BioFast 2: Mediation Technology for Biological Pipeline Analysis

As scientific discovery has shifted from the traditional wet-lab, access to digital data repositories assumes increasing importance for biological pipeline analysis. The proposed research aims at supporting biologists to express their pipelines and to access, collect, analyze, and integrate data. Our research is motivated by the following: A biologist wants to explore all available knowledge about a set of cancer-related proteins. While biological databases may be individually well-curated and richly interconnected, the tools to express complex data collection and analysis protocols are yet primitive. Data integration remains a manual, time-intensive, and error prone process. Although the query process can be automated with scripts, scripts are limited as to the biological expertise that they can capture and reuse. We develop computational infrastructure, using XML-based wrapper technology and database/mediation technology, so that a biologist may be able to exploit her domain expertise and express data integration protocols for biological pipelines as complex workflows.

Five intellectual challenges will be addressed

  • Query interface for scientific exploration: The challenge is developing a high-level workflow-style language with appropriate operators and semantics that allow domain scientists to explore the contents and relationships among sources. The operators and semantics of this language must be at the level of the biologist's procedures and experiments, which may then be translated into lower-level data manipulation operators.
  • The specification of two protein analysis pipelines using the biological query interface. Biologists will evaluate both the results of the pipeline as well as the usability of the methodology.
  • Properties and semantics of links and paths: The metrics of links and paths may be used to characterize query results, e.g., to predict the cardinalities of query results along some path or to predict overlap of results. These properties are useful to both biologists and data administrators. The challenge is to model, measure and predict interesting metrics and to enrich the semantics of links and paths with domain knowledge.
  • A mediation testbed implemented using XML based wrapper technology and mediator technology (IBM DB2 II) to validate the biological query methodology and for the efficient maintenance of the pipelines.
  • Cost- and semantics- based optimization for query answering: Answers to explorative queries typically require traversing a multitude of paths among highly inter-linked sources. Each path differs in cost and benefit (result cardinality), making it non-trivial to choose the best path or set of paths. To compound the problem, the results of different paths overlap, so cost and benefit must be considered for combinations of paths.

Participants

  • Zoe Lacroix , Arizona State University, PI
  • Louiqa Raschid, University of Maryland, PI
  • Terry Gaasterland Rockefeller University PI
  • Maria Esther Vidal, Universidad Simon Bolivar (Collaborator)
  • Felix Naumann, Humboldt University of Berlin (Collaborator)
  • A specific example of a pipeline for genomic alignment

LOOKING FOR A GRA OR RESEARCH TOPIC ?

Recent papers

A Data Model and Query Language to Explore Enhanced Links and Paths in Life Science Sources. George Mihaila and Felix Naumann and Louiqa Raschid and Maria Esther Vidal. PDF

Challenges in Selecting Paths for Navigational Queries: Trade-Off of Benefit of Path versus Cost of Plan. Vidal, M.E. and Raschid, L. and Mestre, J. PDF Proceedings of the Seventh International Workshop on the Web and Databases (WebDB 2004), in conjunction with ACM SIGMOD.

Querying Web-Accessible Life Science Sources: Which paths to choose? Jens Bleiholder, Felix Naumann, Louiqa Raschid, Maria Esther Vidal. PDF Proceedings of the Workshop on Information Integration on the Web (IIWeb), in conjunction with VLDB 2004.

A Protocol to Extract and Generate Links Capturing Marker Semantics from PubMed to the Human Genome. Alex Lash and Woie-Jyu Lee and Louiqa Raschid. PDF Under review.

Enhancing the Semantics of Links and Paths in Life Science Sources. S. Heymann, F. Naumann, L. Raschid and P. Rieger. PDF Workshop on Database Issues in Biological Databases (DBiBD), in conjunction with ICDT 2005.

Project Sponsors

This research is sponsored by the National Science Foundation NSF SEIII Program .