Whole genome sequencing (WGS) is a powerful tool for public health infectious disease investigations owing to its higher resolution, greater utility, and cost-effectiveness over traditional genotyping methods. The Integrated Rapid Infectious Disease Analysis (IRIDA) platform is a user-friendly, distributed, open source bioinformatics and analytical web platform, developed to support real-time infectious disease outbreak investigations using whole genome sequencing data.
While IRIDA was initially created to support the Canadian public health system, instances can be independently installed in any high performance computing environment, enabling private and secure analyses according to organizational policies and governance. Communication between instances can be enabled through administrator-controlled sharing of sample and project information. IRIDA’s data management capabilities enable secure upload, storage and management of all sequences and metadata, also providing the transparent provenance and auditability required by clinical and bioinformatics best practices. The Galaxy-driven execution of quality control checks, assembly, annotation, SNV-based phylogenetic analysis, in silico serotyping and cgMLST pipelines, simplifies visualization and evaluation of results for lab analysts, while providing power-users with modularity and customizability.
The IRIDA platform enables fast, scalable, private (and shareable) analytics and visualizations for WGS-based microbial pathogen investigations, and is currently transforming the Canadian public health ecosystem.
The Goal of IRIDA
The IRIDA (Integrated Rapid Infectious Disease Analysis) project is a collaborative effort between Canada’s National Microbiology Laboratory (Public Health Agency of Canada), BC’s Centre for Disease Control, and Simon Fraser University, to develop a free, secure, open source, user-friendly web platform to support real-time infectious disease outbreak investigations and pathogen surveillance using genomic data.
Under the umbrella of the IRIDA project, the team was tasked to create the platform, as well as build functionality through additional tool development. These efforts include the genomic island visualization resources IslandViewer and IslandCompare; the phylogeography and statistical analysis platform GenGIS; the Salmonella in silico Typing Resource SISTR; and the cgMLST web-based visualization software PhyloViz Online. The IRIDA vision includes plugin architecture allowing developers and instance administrators to individually package their pipelines and link-out to remote services, according to the needs of their users. IRIDA is under continual development, and welcomes collaborations to better enable 3rd party application integration.
Currently, IRIDA offers:
- Management of genomic sequence data and metadata
- Rapid processing and analysis of genomic data i.e. QA/QC, assembly, annotation, SNV-based phylogeny
- MLST-based phylogeny (MentaLiST)
- Salmonella cgMLST-based genotyping and in silico serotyping
- Informative visualizations of genomic analysis results
- Controlled access to data via standardized REST API
- Data sharing between instances
- Open source (under Apache 2.0 license)
- Privacy protection
- Data standardization enabling value-added activities
- User support
- IRIDA software is FREE
- VMs and a public instance (sfu.ca/XXX) are available for user evaluation
- Development reflects best clinical, bioinformatics, data stewardship practices
Benefits of Using the IRIDA platform
IRIDA’s Core Functionality:
IRIDA instances behave like secure, independent, data management environments, allowing users to maintain complete control over their data during upload, analyses and storage. Through IRIDA’s distributed framework, data sharing between IRIDA instances can be enabled by altering access permissions on samples, projects and analyses.
Workflows and Analyses
sers can upload sequence reads from Illumina MiSeq sequencing instruments using the MiSeq Uploader, as well as FASTQ files, and assemblies from various sources e.g. public repositories (NCBI). Raw sequence data can then be curated for quality, and submitted to a number of different workflows for processing and analysis, such as IRIDA’s Assembly and Annotation Pipeline which performs de novo or referenced-based assembly and annotates genes and other genomic features, either singly or in batch format; IRIDA’s SNVPhyl Single Variant Nucleotide Phylogenomics pipeline which generates SNV distance matrices and SNV-based phylogenies; the MentaLiST k-mer based core genome Multilocus Sequence Typing (cgMLST) pipeline which genotypes bacterial isolates directly from reads; and the Salmonella In Silico Typing Resource (SISTR) which performs Salmonella genoserotyping , MLST, rMLST as well as cgMLST.
SNVPhyl has been used for analysis by hundreds of public health analysts at Canada’s National Microbiology Laboratory (Public Health Agency of Canada’s national reference lab) to support research and provincial laboratory services since 2010. The pipeline is currently being used for outbreak investigations and has been validated as part of a suite of tools used by PulseNet Canada for routine foodborne disease surveillance activities.
Most epidemiological investigation pipelines have been built using categorical typing results as components of case definitions and as evidence for outbreak protocols. Multilocus Sequence Typing (MLST) constructs an ‘allelic profile’ based a limited number of loci from an established MLST scheme. As genomic epidemiology studies strive to gain more detailed strain typing information, these MLST schemes have expanded to incorporate larger portions of the genome. These ‘core genome MLST’ (cgMLST) and ‘whole-genome MLST’ (wgMLST) analyses present a computational challenge. In order to provide similar fit-for-purpose categorical data, IRIDA has integrated the Salmonella In Silico Typing Resource (SISTR) and MentaLiST.
The SISTR bioinformatics platform provides rapid in silico cgMLST and serotype inferencing based on draft Salmonella genome assemblies. SISTR is currently being used to generate serotype predictions for all genomes submitted to EnteroBase, the largest repository of Salmonella WGS data worldwide (https://enterobase.warwick.ac.uk/). SISTR was also recently used to generate predictions for the 30% of Salmonella genomes that have been deposited at NCBI with missing serovar information (Yoshida et al, 2016). Implementation of SISTR has also led to the phasing out of antigen-based serotyping at Canada’s National Microbiology Laboratory, which has moved to WGS as the primary means of characterization of Salmonella isolates from national surveillance programs as of May 2017.
The Salmonella In Silico Typing Resource (SISTR)
MentaLiST is an MLST analysis tool, based on a fast k-mer voting algorithm. MentaLiST is able to make MLST calls directly from raw sequence reads, avoiding a slow assembly stage common to many previous MLST tools. It is specifically designed and implemented to handle large typing schemes. MentaLiST supports automated downloads of typing schemes from public databases such as pubMLST.org and cgMLST.org.
Sequence data and analyses can be exported for external storage and analyses using a variety of mechanisms. IRIDA’s REST API can be used to export data to third party applications (e.g. BioNumerics; tool available from Applied Maths), as well as through IRIDA’s command-line linker. Users can also upload sequence files to NCBI using IRIDA’s Upload to NCBI SRA export feature.
Metadata and Visualizations
Metadata can also be stored alongside sequences for describing and tracking samples. Contextual metadata such as specimen type, geographical location, exposures, collection date, case and investigation IDs, can be uploaded through IRIDA’s Metadata Manager uploader, which accepts user-defined spreadsheets (.csv) files. Metadata in Sample files can be edited once uploaded. Metadata can also be uploaded to Sample files without the addition of sequences, to facilitate the comparison of epidemiological and laboratory information in line lists in the absence of genomic data.
Uploaded metadata can be used to label branches in phylogenomic trees for cluster analysis and decision-making using IRIDA’s Advanced Phylogenomic Visualization tools.
Enhanced Data Harmonization and Integration
IRIDA will be the world’s first genomic epidemiology platform to develop and implement ontology to offer standardized terms for describing metadata, which can improve data harmonization and integration between different groups and organizations. Best data stewardship practices encourage the encoding and storage of digital assets such as sequence metadata through the use of community standards like minimum information checklists and ontologies, which better prepare this information for different applications. As such, the IRIDA team has created the Genomic Epidemiology Ontology (GenEpiO) to describe sample details and provenance, as well as clinical, epidemiological, lab, and genomics data and methods. IRIDA developers are currently creating metadata templates offering ontology fields and terms for users to standardize their metadata. For more information regarding IRIDA’s data integration efforts leveraging our new ontologies, visit the Data Integration page (under Highlights).
IRIDA’s Design Philosophy
By providing rapid, secure sharing of WGS data between lab analysts, epidemiologists, and other stakeholders, data within IRIDA can be leveraged for faster trace back, enhanced risk assessment and regulatory actions, improved patient health, as well as discovery and expert analyses of genetic events and phenomena that potentially impact health outcomes.
Decentralization in health care is a common phenomenon pursued to promote efficiency and responsiveness at the local level. As such, no “one-size-fits-all” genomic epidemiology platform has been developed or universally accepted owing to differences in national health systems, data sharing policies, computational infrastructures, lack of interoperability and prohibitive costs.
To facilitate data exchange between different health agencies and health authorities, the IRIDA platform employs a distributed approach to providing programs and pipelines for users across Canada, and around the world, to manage outbreaks and perform microbial surveillance activities across different systems.
IRIDA development utilizes Galaxy’s collection of interoperable and open source bioinformatics scripts, pipelines and tools to create transparent (no black boxes), seamless and modular workflows. These design decisions were implemented in order to more easily integrate additional analytic and visualization modules, as well as import/export data to 3rd party applications and more centralized resources. By implementing Galaxy-driven software, IRIDA retains the flexibility and interoperability for further development according to user, developer and community needs.
The IRIDA platform is being designed in consultation with local, provincial national and international stakeholders to ensure usability and improve interoperability with current systems and programs. Consultation by the design team with end users is critical to ensure that all the required and desired functionalities are implemented, that the patient and sample information necessary for analyses is captured, and that these elements have the highest accuracy and efficiency achievable.
This work was supported by the Genomics Research and Development Initiative (Public Health Agency of Canada), Genome Canada, and Genome British Columbia, and Compute Canada, with the support of AllerGen NCE Inc.