The newest sequencing technologies are providing new insights into disease etiology, individual susceptibilities, keys to drug discovery, and more. However, there are many questions and challenges regarding how to understand the meaning and applicability of the newfound wealth of data. As new data is generated at record rates, scientists are working diligently on the development of methods to archive the massive sequencing information and create bioinformatics tools to analyze the data mammoths.
We are seeing in real-time the generation of colossal amounts of data by single next-generation sequencing (NGS) runs. With the thousands of runs performed in a short period, the results are massive and seemingly cosmic sets of data with yet-to-be-determined significance. This creates the need to efficiently process, archive, and secure the data (specially protected patient-related data). Still, the data must be interpreted to determine its significance and utility for translation to the clinic, regulatory entities, and various medical industries.
Important steps in the data storage effort are decision-making for data trimming and triage. Without this process, there would be a massive duplication of data and accumulation of background information or “noise.” Policies are developed to systematically identify and remove redundant information or artifact-related data. For example, high-throughput sequencing (HTS) reads can often contain regions of low base-calling quality, particularly at the final stretch of the read. These can be identified and excluded.
The actual physical space for the data must also be considered. As zip files are produced to compress large groups of files, there are compression methods to store large volumes of sequencing data and even facilitate analysis. General (gzip, bzip) and specialized (GenCompress, DNACompress, DNABIT) algorithms have been applied to compress large amounts of sequencing data (1). The DNA compression algorithm DNABIT Compress assigns binary bits for segments of DNA bases to compress repetitive and nonrepetitive DNA sequences (2). The binary coding significantly cuts down the file size. A reference-based compression method entails aligning new sequences to a reference genome followed by encoding and storing the differences between the new and reference sequence (3).
Data storage challenges are not limited to the technical details of how to physically store the sequencing data. Other issues concern the use and protection of individual medical information. The availability of genetic testing for consumers presents ethical concerns and questions regarding the use of stored consumer data. Studies and surveys of genetic testing companies have revealed that some companies may have used consumer data in research efforts, and the existence of policies regarding data sharing were not always apparent (4). This demonstrates the need to develop standards and policies to protect the integrity of stored sensitive consumer DNA sequencing data.
Numerous tools exist for genomic data analysis. The types of tools vary according to the algorithms used by the tools, the software and hardware needed to run the programs, and the programming languages used. There are also categories of bioinformatics tools based on the type of genome to be analyzed.
The sequence analysis phase has the challenge to obtain meaningful medically significant information from data that is now in the terabyte range. Software and apps are available for scientific teams to analyze their own data. However, there is a growing list of companies that offer analysis services. To gain skills and applicable knowledge in the DNA sequence analysis arena, there are courses and tutorials available to help scientists learn to analyze their NGS and other data.
Examples of sequence analysis objectives are variant detection, screening for protein-DNA interactions, and discovery of unique transcripts. Pabinger et al published a survey of bioinformatics tools for variant analysis of NGS data (5). An open-source, web-interface software tool was developed by Zomer et al to analyze TnSeq-derived data to find essential genes (6). The typical process for sequence analysis involves base calling and obtaining raw data reads. These are reassembled de novo or alignment to a reference is performed. In the case of variant detection, differences between a sample and the associated reference genome are identified.
Once data has been processed and analyzed, it is necessary to know what the data means in clinical, drug discovery, and other applied bioscience endeavors. There is software that assists in making inferences from analyzed sequence data to determine medically relevant and actionable information. An example is a sequence to medical phenotypes (STMP), an open-source pipeline for clinical interpretation of sequence data. This program allows the determination of genetic drug responses, as well as genetic disease risk (7).
The ENCODE (Encyclopedia of DNA Elements) Consortium, funded by the National Human Genome Research Institute, is designed to produce a comprehensive collection of functional elements in the human genome. The goal is to provide genomic information that will help to determine the relationship between DNA sequences and disease development and management. This and other developing databases are accessed to determine the role of analyzed sequences in biomedicine.
The impressive and ever-growing availability of informatics tools is proving to be indispensable in the effort to manage and apply HTS data to biomedical and clinical efforts. However, numerous challenges continue to exist. Improvements in the validation of bioinformatics tools and reproducibility of variant detections are ongoing endeavors. However, it has been demonstrated that HTS data provides significant information that can be applied to early clinical diagnosis and more successful treatment strategies.
References