The IRIDA platform is implementing ontology as a means of enabling the integration of different data types required for outbreak investigation, surveillance and reporting. Ontology is a software tool that allows data to be both human and machine-readable as terms are standardized and reused between information types. IRIDA’s Genomic Epidemiology Application Ontology (GenEpiO) is another innovation enhancing the platform’s analytical power.

The Need for Data Harmonization

Timeliness of infectious disease analyses is key for reducing the number of preventable cases of disease. The ability to resolve outbreaks relies heavily on good contextual information regarding “person, place and time”, which is crucial for identifying sources of contamination and exposure. Contextual information is also required for human health risk assessments, source attribution, ecosystems modelling, and in the simplest terms, to make sense of the genome data.

Contextual information includes details pertaining to how a sample was collected, types of samples from which a strain was isolated, methods for testing and typing, laboratory and epidemiological data. Contextual information informs the interpretation of WGS results used as evidence for decision making.


The digitization of genomics allows for increased resolution of infectious sequence types and rapid transmission of data, however, significant computational challenges remain in terms of genomics result reporting and analysis. Raw genome sequences need to be processed and presented differently and in a timely and secure manner to end-users in the health care environment with vastly different roles (attending physicians, infection control, environmental health officers, medical health officers, public health epidemiologists, etc.) and affiliations. The ability to share secure and standardized data within and across organizations is critical to implement genomic epidemiology for public health microbiology.

Sequence data and digitized contextual information are known as digital assets – that is, they can be used for many different purposes and investigations. Best data stewardship practices state that digital assets like contextual information should be stored in a way that is FAIR (Findable, Accessible, Interoperable, Reusable) to maximize value and best prepare the data for future applications.

Significant challenges for public health and infectious disease data integration are posed by the lack of standardization. Contextual information is often recorded using free text or incompatible data dictionaries. During an outbreak, information from different sources must quickly be harmonized and combined in order to identify the source of a pathogen and its routes of transmission – especially when outbreak investigations extend beyond agencies and borders. Manual recoding and integration of data can take hours, days or even weeks to complete. These challenges impact computability for fast analyses, affecting time-to-response.

Challenges for fast and computer-amenable analysis using contextual information include the use of error-prone free text which is difficult to mine. The use of short hand and institution-specific jargon can result in semantic ambiguity (where words can have different meanings in different contexts). Contextual information is often inconsistently collected, using different variables and fields.

Using standardized terms, or mapping institution-specific fields and terms to an controlled vocabulary, better enables software systems to communicate and facilitate data integration and exchange.


Ontologies as a Framework for Data Integration

A solution for providing a framework for integrating clinical, epidemiological and laboratory (genomic) data types is through the use of ‘ontologies’. Ontologies, well-defined and standardized vocabulary interconnected by logical relationships, are constructed in such a way to facilitate fast and automated querying.

Example of a simplified food ontology. Terms are organized hierarchically, where more general terms are found at the root, and more granular terms are found in the tips. Terms are linked by logical relationships. The “is_a” relation is one of the most common relationships, and forms the backbone of the hierarchy e.g. Curled endive “is_a” type of Endive. There are different types of relations that can link information in different ways e.g. Leafy vegetable “has_disposition (or the ability to) act as” a transmission vehicle for a pathogen. These relations can serve to integrate food types and products with pathogen data.


Ontologies, simply put, are computer files which organize things into classes of terms, and link those classes together in different ways. Ontology files can be implemented in different spreadsheets, applications and platforms according to the needs of their users. Standardization of vocabulary allows for increased interoperability between systems and integration of previously isolated databases as well as resolving semantic ambiguity. Highlights of the benefits of ontologies for surveillance and detection activities include:

  1. Faster data integration and exchange based on standardized fields. The longitudinal nature of pathogen surveillance requires information to be propagated and compared between agencies, which can occur much more quickly and in a computer-amenable manner if contextual information is standardized.
  2. Mapping of institution-specific terms used in public health interfaces to standards allow for customized data entry while facilitating interoperability.
  3. Standardized quality control and result reporting trigger actionable events in same way, which will contribute to the accreditation and validation of clinically implemented genomics pipelines.



The Open Biomedical Ontologies (OBO) Foundry

The particular uses of an ontology can influence the way it is constructed. The architecture of an ontology can significantly impact the way it can interact with other ontologies, resulting in incompatibility. The OBO Foundry is a community of scientists committed to creating interoperable biomedical ontologies through collaborative development. The principles and practices of the OBO Foundry (e.g. common architecture, multiple users to increase usability, the use of IDs to disambiguate terms and their meanings) have created >150 interoperable ontologies that describe many different domains of knowledge e.g. the Gene Ontology (GO).


IRIDA’s Genomic Epidemiology Application Ontology (GenEpiO)

Our research efforts include the development of a Genomic Epidemiology Application Ontology (GenEpiO), based on public-health stakeholder interviews and the harmonization of important laboratory, clinical and epidemiological resources. The goal is to develop an ontology that supports an end-to-end genomic epidemiology pipeline, in order to fully propagate all of the necessary contextual information required to interpret genomics data, from the point-of-intake through sequencing to end use (eg. in an epidemiologic investigation).

Since diseases do not respect international borders, uptake of a common, standard vocabulary for describing outbreak and surveillance activities is crucial for inter-jurisdictional interpretation of results and data sharing.

GenEpiO has been built according to the principles and practices of the OBO Foundry, and aggregates pertinent terminology from a number of existing OBO Foundry ontologies. GenEpiO contains >4000 key fields and terms to describe sample metadata, lab analytics, clinical information as well as exposures and epidemiological data. GenEpiO incorporates fields from community standards e.g. NCBI BioSample and the MIxS minimum information checklist, as well as existing ontologies to ensure the accuracy of meaning and facilitate interoperability between software systems.

GenEpiO is an application ontology – that is, it contains a particular combination of existing ontology terms to fulfill a certain purpose. Organizations can use all, or parts, of the ontology directly, or they can just use the parts that they need. Users can also map their institution-specific terms (i.e. preferred labels) to ontology terms in order to increase interoperability while maintaining customization of their interfaces.


The Genomic Epidemiology Consortium

Harmonization of the genomic epidemiology ontology can only be achieved by consensus and wide adoption, and international input and expertise is crucial to achieve these goals. In order to ensure that GenEpiO is sufficiently robust to serve all use cases, we have formed an inclusive International Genomic Epidemiology Ontology Consortium to build partnerships and solicit domain expertise. GenEpiO has been developed in collaboration with the International GenEpiO consortium, which has >80 members form 15 different countries. The consortium includes leaders from different health, regulatory, academic and standards communities, and representatives from different sectors. All interested individuals are welcome to participate. More information regarding GenEpiO’s design, how to contribute new terms, and our goals and activities, can be found at To join, or find out more about our please contact

In addition, other key ontology domains under development include Antimicrobial Resistance (ARO), Pathogen Surveillance Ontology (SurvO), and the Mobile Elements Ontology (MobiO), all critical for tackling the global threats of antibiotic resistance and emerging pathogens. As good food descriptors for food products and food production environments are key for surveillance and foodborne outbreak investigations, we have also created the Food Ontology (FoodOn) to hold this content, and have led the formation of the FoodOn Consortium to support its use in various academic, public health, and industry contexts.

Community contributions welcome.  


Ontology Tool Development

We are also developing tools to better enable users to interact with our ontologies, such as the Genomic Epidemiology Entity Mart (GEEM). GEEM enables software developers to shop for fields and terms appropriate for the needs of their users in order to crate data specifications which can be used to create ontology-driven interfaces and applications. GEEM features browse & search, shopping cart and discussion tab (for ontology curators) functionalities.

For more information regarding GEEM, as well as other text parsing and text matching tools under development, please contact