[SOUND] Hi, my name is Stephan Schurer from the University of Miami Medical school. This lecture is an introduction to data standards, metadata, and ontologies. It follows a brief overview given in the previous lecture. In the Biomedical Data Sciences, we can categorize data standards into three general categories. The first one is reporting guidelines or checklists, these are also often called minimum information standards. The second one are controlled vocabularies terminologies and the third one are data exchange formats. So, the reporting guidelines are minimum information checklists to specify what kind of information need to be captured about an experiment or a study for a particular purpose. So, for example, if you have a survey about people, then you may want to capture their age, gender or other information. Controlled Vocabularies, they are terminological resources that provide identification and definition of entities. So we have a unique format to refer to something. For example for gender, people choose male or female. Data exchange formats are specifications how data are encoded so that they are computer readable and computer processable. This is of course very important for different software systems or different computational agents can talk to each other, so they understand each other's information. So typical data exchange formats for example, would be JSON, or Tab Delimited, or Comma Delimited CSV. An important consideration for data is how they are organized. So data structures refer to the organization of data. So for example, the entity relations diagram in the relational database schema. One ontology that organizes information in a certain way or how data is structured in a schemalist database or the relationship of different objects and object oriented database. Many biomedical data standards have been developed over the years for many different applications. So it can be very hard to find those on the Internet. One solution to this problem is the BioSharing Standards site, which is maintained at Oxford University E-Research Center. The BioSharing information resources have been curated. So that means they have been reviewed and organized by experts. That adds a lot of value, so you can be sure that those resources are maintained and of high quality. BioSharing site includes many terminological resources, data format and model resources and reporting guidelines. These are the three data standards that we covered earlier. In addition it also has progressively policies, databases, and other resources. But why are metadata so important? And specifically, why are standardized, formalized metadata so important? What justifies the significant effort of developing reporting guidelines, terminologies and data exchange formats. It should be quite clear that we really can do anything with data if you don't have metadata. So really, metadata are always on the data but here we're talking specifically about standardized metadata. And that is very important if you want to exchange information, reuse information, share information, and build information systems. So specifically, metadata are very important to facilitate data replicability, reproducibility, and data reuse. They're also critical to enable us to interpret results, perform data analysis, and develop hypotheses. Metadata enable the repurposing of data for other projects. So data that I developed for one purpose in one project can be used in another project because they fit together based on their specific metadata. And metadata are important for information systems so we can query and search data sets, we can integrate data and you can exchange data between different systems. So, for example, any term or any information that you search on data or by which you search data are typically captured by metadata and to do this in an organized way, we need metadata standards. Talking about Data Replicatability, Data Reproducibility and Reuse of Data. It is very important to precisely define what those terms actually mean. In many cases, in the videos talk about reproducibility, but in fact they mean replicability or in some cases they even mean just repeatability. So here is very useful definition of those different terms. Repeat means the same experiment done in the same laboratory, often by the same individual, exactly the same way. So repeat simply means doing exactly the same thing in exactly the same way multiple times, and getting the same results. Replicate means do the same experiment but in a different laboratory. So it refers to the fact that somebody else can do the same exact experiment. Reproduce refers to the same experiment but in a different setup. So that would be different instrumentation, different laboratory, different individuals, different reagents, or at least different reagent batches. So Reproducibility means that they can get the same kind of biological result but not in exactly the same way by repeating exactly the same experiment in the same way. So I want to re-emphasize the point here Replicability is not Reproducibility. Reuse of data is, of course, a different experiment. But they have some connection between the two experiments or the two different datasets and this connection is made explicit by standard estimated data. So I can use data generated in one experiment to compliment data generated in another experiment. This data integration or data linking is only possible with explicit standardized metadata. So to combine different types of datasets generated in different experiments that have some connection of course requires that there are standardized and unified metadata across those different experiments. And of course to reuse data and to add value requires a data of high quality. So, that would typically also implies that the data are reproducible. What information, which specific metadata should be captured for particular types of experiment if you find the minimal information reporting guidelines as mentioned earlier. A great resource for minimal information reporting guidelines is the MIBBI, Minimal Information for Biological and Biomedical Investigations. This MIBBI foundry is now integrated in the BioSharing cycle I mentioned earlier. If followed, many more information reporting guidelines assure that the data can be easily verified, analyzed, and clearly interpreted by the wider scientific community. These recommendations also facilitate a foundation of structure databases, public repositories, and the development of data analysis tools. To build information systems and data analysis tools requires not only that the minimum metadata are captured, but also that the metadata are captured in a standardized and unified way. This is a controlled vocabularies command. The BioSharing site makes an effort to actually link reporting guidelines to terminological resources. Controlled vocabularies or thesauri describe what things mean by linking terms to human descriptions. So they link entities into identity criteria, so that they can refer to the same thing, using a language term. By doing so, they enable us to share knowledge in a common language. Thesauri also often includes natural language synonyms, and those are important for search and text mining applications. Controlled vocabularies therefore facilitate the linking and searching of information in software systems. For example, in the system they can search or analyze different types of information that are linked by those controlled vocabularies. And, of course, they enable humans but not computational agents to reference and exchange knowledge. If you want to build a computational system that can operate on knowledge, one first has to formalize knowledge in a way that it can be understood and processed by the computer. One way of doing this is using ontologies. The original use of the term ontology goes back to the Greek philosophers Parmeniedes and Plato. The philosophical definition of ontology is the study of existence and the nature of being. So in philosophy, ontology would ask which things exist, what is the nature of things, and what is reality? The computer science definition of ontology, which we are talking about here is a formal description of knowledge of a subject domain of interest. So an ontology would be a formal, logic based definition of the types, their properties and their interrelationships within real objects in a particular subject domain. For example, to formalize the domain family, we can define the relationships husband wife, mother daughter, brother sister, and so on. If done properly, the knowledge in the subject domain, in this case family, can be formalized using logical actions. So for example, in the case of family we would understand that a cousin means a child of a father's or mother's, brother or sister, for example. To be computer processable, of course all of this also needs to be formalized in a machine processable specification. One such data format for ontologies, for example, is Web Ontology Language, in using a particular version of logic or description logic. So given all this, a very brief definition of ontology can be given as, an ontology is a specification of a conceptualization. In addition to controlled vocabulary in which entities are defined using human definitions, An ontology contains entities, which are called classes and their relationships which are called object properties. By doing so an ontology allows to capture an abstract knowledge using logical axioms. And this is done using explicit specification based on logic. And using a language. For example, BEP, Ontology Description Language Description Logic, OWL-DL. So an ontology allows us to build a formal knowledge model, and allows to compute with that knowledge, using so-called reasoning engines. Ontologies are a foundation of semantic web information systems. Semantic Web is also sometimes referred to as Web 3.0. So ontologies are formalized representations of knowledge to enable computing said knowledge. That is in contrast to controlled vocabularies, which allow humans to exchange knowledge but not computational agents. For those of you who know relational databases and are interested in Semantic Web technologies, here is a brief contrast of relational database management systems for those ontologies. Relational databases operate under the closed world assumption. That means there is a pre-defined schema and every entity has to fit somewhere in that schema. Ontologies, in contrast, operate under the open world assumption. That means that a class is defined by its relationships to other classes and that the reasoning engine decides where it is placed in the framework of the ontology. Relational database management systems do not support reasoning. Everything we want to query from the system needs to be put in explicitly to the database. In contrast, semantic web technologies support inference reasoning. For example, it would be relatively simple to infer somebody's cousin if you know his or her parents, the parents' siblings and their children. So in contrast, the relational database system, we do not explicitly have to put in who is somebody's cousin. But you can infer that information. To ask specialized query to a relational database, you need to know the schema and all of the combined information from different tables. In contrast, semantic web technologies, via formal semantics, provide a restriction free framework to ask queries. In a way, those queries work by traversing a graph that is defined by the ontology. As a consequence, in relational database systems, data sharing is often not easy, because there are no formal semantics. And if you want to exchange any information, the schemas would have to be somehow compatible. In contrast, semantic web technologies provide easier data sharing and knowledge sharing. Data and knowledge sharing using semantic web technologies requires that the ontologies talk to each other, and this is facilitated by formal semantics, and this is relatively easy for common domains. For example, family, where everybody agrees what the relationships and the classes are. But it is much more difficult and not yet achieved for very complex knowledge. For example, in the biomedical domains where there are many different ontologies for overlapping subdomains. Information systems based on semantic web technologies use so-called triple stores. In a triple store, all relationships between all individuals are explicitly stored. Relational databases are very powerful for very large datasets that all have the same structure. Where triple store can be more flexible, and it's better applicable to very complex domain knowledge. Relational database systems are an established technology, and there are industry standards, for example, Oracle. Open source examples of relational database systems include MySQL and Postgres. In contrast, semantic web technologies such as Jena or Virtuoso are still relatively early stage, and standards in this domain are still emerging. Hundreds of biomedical ontologies have been developed over the years. One of the most comprehensive repositories of biomedical ontologies is the NCBO Bioportal, which has already been introduced in the previous lecture. Another important resource of biological and biomedical ontologies is the OBO Foundry. In contrast to the NCBO Bioportal, which has a very open policy in allowing developers to deposit their ontologies, the OBO Foundry takes a much more selective approach. OBO Foundry ontologies have to comply with Foundry principles. Including design decisions, naming conventions and the use of certain upper level ontologies. New ontologies are typically admitted as OBO Foundry candidate anthologies and only after a thorough review process can they be promoted to OBO Foundry ontologies. One of the goal of the OBO Foundry is to promote the collection of compatible, and non-overlapping biological and biomedical ontologies. The OBO Foundry is a good starting point to research existing biological and biomedical ontologies. Another important resource of biological and biomedical ontologies is the EBI Ontology Lookup Service at the European Bioinformatics Institute. The EBI Ontology Lookup Service provides a centralized query interface for almost 100 ontologies. During the last section of this lecture, I want to briefly introduce the LINCS metadata standards. Recall that LINCS signatures have three primary dimensions. The first dimension is the biological model system. LINCS by logical model systems include proliferating immortalised cell lines, primary cells and use pure reportant stem cells and differentiated cells. Another dimension is the perturbation of the model system. LINCS perturbagens include small molecules, RNAis, proteins and other reagents. The third dimension of LINC signatures are the molecular entities and cellular features that are detected unquantified in an assay. These, for example, can include genes which are quantified in transcription and profiling assays, or proteins which are detected and quantified via antibodies and proteomics assays. They also include phosphoproteins that can be quantified via mass spectrometry for example via P100 Probe and P100 phosphoprotein or mass profiling assays. These also include cellular features which are detected in various links phenotypic cell profiling assays. We have been developing metadata standards to uniquely identify molecular entities, model systems, and other concepts that are important to describe link signatures and assays. To metadata standards specifications assure that all metadata are kept to sufficient detail that enable us to integrate the data And to create a common view across all the different LINCS data generated by different LINCS assets. During identifying metadata entities, we also capture annotations that are important to link LINCS data to third party resources, and to analyze and query link data. For example, cell lines annotated by tissues and diseases, often molecules will be annotated by known targets. LINCS metadata standards specifications for many categories have been released and are published on the LINCSproject.org website under Data Standards. The standardized LINCS metadata entities including small molecules, cells, proteins, transcribed genes, and their annotations, Are registered and stored in the dedicated information system, the LINCS Metadata Registry. The LINCS Metadata Registry also captures the assays, organization, projects, roles, and data sets, and all the relationships between those and the LINCS metadata categories. It is accessible here, the given URL. Clicking on any of the titles of the user interface it will bring up a table with all the entries and the annotations of that particular category. For example, here, link sell lines with values annotations including species, organ and many different other annotations, IDs, provider and so on. This interface also allows filtering and text based clearing of all the records of this given category. It also allows downloading of all the entries and the annotations. One can also download a template which can then be used to upload additional records of that given category. The links may not add a registry It's integrated with other software systems. Together, those systems form the integrated knowledge environment that we are developing for the Links project. The design of the MDR system as a core component of the Links integrated knowledge environment emphasizes the importance of structured and standardized metadata. Finally, a selected list of useful ontology resources. The NCBO Bioportal It's the largest repository of biomedical ontologies. The OBO Foundry includes only carefully selected and thoroughly reviewed ontologies. The EBI Ontology Lookup Service provides a web service interface to carry multiple ontologies from a single location with a unified output format. OntoBee is a link data server for ontologies It de-references an ontology term and allow us to use that term via its URL in an HTML webpage or audio source code. OntoFox is a web based ontology tool that allow us to extract ontology terms and actions. It is useful to extract modules from existing ontologies and reuse those modules in new ontologies. Protege is probably the most widely used ontology editor. Which you will need to develop or modify an ontology. It is a free open source tool, developed at Stanford University. An introductory guide to develop your own first ontology, Ontologies Development 101 is available at the Protege website. Formal ontologies are to be developed in the Web Ontology Language. The direct model theoretic semantics of Web Ontology Language 3 is available at the W3C site. This document requires a basic understanding of description logics. An introduction to description logics is available on Wikipedia. [MUSIC]