Algorithms for the Analysis of Data from Massively-parallel Genome Sequencing

New generation DNA sequencing technologies are revolutionizing modern biological research. Scientists can now generate the rough equivalent of an entire human genome (~3 billion base-pairs of DNA) in just a few days with one single sequencing instrument. Until recently, such amounts of data could only be generated at large genome centers using hundreds of sequencers. The analysis of these data is complicated by their size - a single run of a sequencing instrument yields terabytes of information, often requiring a significant scale-up of the existing computational infrastructure. This project is developing parallel algorithms for analyzing new generation sequencing data with a specific focus on the Map-Reduce paradigm implemented on a highly-distributed computing cluster supported by Google and IBM. The project is primarily focused on developing algorithms for sequence alignment and sequence assembly, critical tasks in the analysis of genomic data, and involves the adaptation of string matching and graph algorithms to the Map-Reduce paradigm.

This work will potentially lead to parallelism-enabled genomic analysis software that will allow researchers to analyze new generation sequencing data through web-scale computational resources, thereby obviating the need for establishing and maintaining a local high-performance computing infrastructure. The software developed during this project is being made available under an open-source license in order to encourage broad use and to enable future research. The research is integrated with teaching and mentoring of graduate and undergraduate students and the results of the work will be disseminated through journal publications and conference presentations.

Principal Investigators