III: Small: Genome Assembly Using Sparse Sequence Info

Rapid advances in DNA sequencing technologies are providing scientists with the ability to rapidly and cost-effectively decode the genomes of organisms. Current technologies, however, can only reconstruct a fragmented picture of a genome's chromosomes. Stitching the resulting fragments together into a complete genome currently requires costly and time-intensive laboratory experiments. The goal of this proposal is to develop new computational approaches that combine sequencing data with the data generated by modern high-throughput mapping technologies in order to enable the automated reconstruction of much larger genomic segments, up to whole chromosomes, than currently possible. The proposed research will be closely integrated with educational activities at the University of Maryland, College Park through the mentoring of undergraduate and graduate students and of a postdoctoral fellow.

Despite tremendous advances over the past 20+ years, both in sequencing technologies, and in computational algorithms for genome assembly, the genomes of the majority of organisms cannot be completely reconstructed through fully-automated processes. The best sequencing technologies can only ""read"" up to a few 1000s of letters yet most organisms contain millions to billions of letters in their genomes. At the same time, genome assembly is a difficult computational problem and even the best assembly software can only generate fragmented reconstructions of the genomes being sequenced, primarily due to repetitive sequences found in the genomes of most organisms. The full completion of highly-repetitive genomes requires time- and labor-intensive processes that often last multiple years. High-throughput optical mapping technologies provide a promising source of information that could be used to disambiguate genomic repeats and automatically reconstruct much larger segments of an organism's genome than possible through the sole use of current sequencing data. Optical mapping data describe the relative placement of multiple genomic landmarks (e.g. restriction enzyme recognition sites) along large stretches of a genome, spanning hundreds of thousands of letters and even whole chromosomes. To date, however, there is no algorithmic framework that allows the incorporation of this rich source of information in the assembly process. Specifically, genome assembly can be formulated as a graph traversal problem, finding a path through a complex graph that satisfies the constraints imposed by the data provided to the assembler. Optical mapping data encode a new type of constraint on the possible traversals of a graph, potentially leading to a more complete reconstruction of genomes.

The main goal of this proposal is to develop an algorithmic framework and associated software tools that enable the use of optical mapping and optical sequencing data during the assembly process. It is important to note that constrained graph traversal problems are generally computationally intractable. We propose several heuristic traversals algorithms that can use optical mapping information and are likely to perform well in practice. In addition, computational analyses will be used to determine the combination of parameters for the mapping experiment that generate data that is most informative for the assembly process. These computational predictions will be validated in an experimental setting.

In addition to the main research objective, this proposal will directly contribute to the education of future generations of scientists, both through the mentoring of graduate students and post-doctoral fellows, and through the continuation of a summer research internship for undergraduate and highs chool students. The software developed in this proposal, as well as the scientific publications arising from this work, will be freely and broadly disseminated through open licensing.

Principal Investigators