Strategic Research Initiatives
Bringing digital era formats to omics and health
Idoia Ochoa, Umberto Ravaioli: Electrical and Computer Engineering
Mikel Hernaez: Carle R. Woese Institute for Genomic Biology
Joerg Heintz, Antonios Michalos: Health Care Engineering Systems Center
Addressing the Problem
Heterogeneous omics data are being generated at an unprecedented rate and volume due to advances in biological data acquisition technologies, significant cost drop, and wider access to omics platforms. Consequently, data storage, transmission, visualization and scalable processing have become major challenges in the advancement of biological and medical science research. This sentiment is reflected in the National Human Genome Research Institute (NHGRI) roadmap that asserts that “The major bottleneck in genome sequencing is no longer data generation - the computational challenges around data analysis, display and integration are now rate-limiting.”
Current omics data formats lack important and needed features to cope with this growth and the involved analyses that combine several heterogenous data. Some of these features include selective access to enable data streaming and high-performance processing for fast clinical decision managing, and security features for privacy protection, among others. As such, development of “digital and programmable” formats are of utmost importance. These new formats can facilitate storage and transmission (by reducing file sizes through specialized compression), ensure security (through encryption mechanisms), and facilitate analysis (e.g., by providing annotations, linkage through related datasets, and by allowing selective access over the data), among others. However, even when formats with some of these capabilities are proposed, they are not being adopted. The reason is that switching formats can be a tedious process, as huge amount of data would need to be transformed from the old to the new formats, the numerous tools that exist for analysis of omics data would need to be modified to accept the new format files as input, and new APIs would need to be created to make use of the added capabilities supported by the new formats. Therefore, for the adoption to happen, first the community needs to be convinced that the benefits provided by the new formats are worth the work needed to transform and update all existing infrastructure. In addition, no transformation will happen unless the community agrees on a given format.
The goal of this project is to develop new digital formats for different omics data that will facilitate their storage, transmission, and visualization, and that will allow for a seamless integration of heterogeneous omics data for joint processing and analysis, while ensuring compatibility with existing tools and infrastructure. In addition, the proposed formats will vastly facilitate data exchange between healthcare providers, advancing the integration of omics data in the clinical setting.
- We are working with the genomics community on the new standard for genomic information representation, dubbed MPEG-G. We are developing new technology to be included into the standard, and working on GENIE, the first open source MPEG-G encoder/decoder to showcase the benefits of MPEG-G.
- We are developing a new format for genomic annotation files, that in addition to compression allows for fast selective access over the compressed data, and which can incorporate expression files derived from bulk and single-cell RNA sequencing data.
- We are working on an improved representation for mass spectrometry data, that improves upon the state-of-the-art formats and allows for both lossless and lossy compression.