BioFast
2: Mediation Technology for Biological Pipeline Analysis
As scientific
discovery has shifted from the traditional wet-lab, access to digital data
repositories assumes increasing importance for biological pipeline analysis.
The proposed research aims at supporting biologists to express their pipelines
and to access, collect, analyze, and integrate data. Our research is motivated
by the following:
A biologist wants to
explore all available knowledge about a set of cancer-related proteins. While
biological databases may be individually well-curated
and richly interconnected, the tools to express complex data collection and
analysis protocols are yet primitive. Data integration remains a manual,
time-intensive, and error prone process. Although the query process can be
automated with scripts, scripts are limited as to the biological expertise that
they can capture and reuse. We develop computational infrastructure, using XML-based
wrapper technology and database/mediation technology, so that a biologist may
be able to exploit her domain expertise and express data integration protocols
for biological pipelines as complex workflows.
Five intellectual challenges will be addressed
-
Query interface for scientific exploration:
The challenge is developing a high-level
workflow-style language with appropriate operators and semantics that
allow domain scientists to explore the contents and relationships among
sources. The operators and semantics of this language must be at the level
of the biologist's procedures and experiments, which may then be
translated into lower-level data manipulation operators.
-
The specification of two protein analysis pipelines
using the biological query
interface. Biologists will evaluate both the results of the pipeline as
well as the usability of the methodology.
-
Properties and semantics of links and paths:
The metrics of links and paths may be used to
characterize query results, e.g., to predict the cardinalities of query
results along some path or to predict overlap of results.
These properties are useful to both
biologists and data administrators. The challenge is to model, measure and
predict interesting metrics and to enrich the semantics of links and paths
with domain knowledge.
-
A mediation testbed
implemented using XML based wrapper technology
and mediator technology (IBM DB2 II) to validate the biological query
methodology and for the efficient maintenance of the pipelines.
-
Cost- and semantics- based optimization for query answering:
Answers to explorative queries
typically require traversing a multitude of paths among highly
inter-linked sources. Each path differs in cost and benefit (result
cardinality), making it non-trivial to choose the best path or set of
paths. To compound the problem, the results of different paths overlap, so
cost and benefit must be considered for combinations of paths.
Participants
- Zoe Lacroix , Arizona
State University,
PI
- Louiqa Raschid,
University of Maryland,
PI
- Terry Gaasterland Rockefeller University PI
- Maria Esther Vidal, Universidad
Simon Bolivar (Collaborator)
- Felix Naumann,
Humboldt University
of Berlin (Collaborator)
-
A specific example of a pipeline for genomic alignment
LOOKING FOR A GRA OR RESEARCH TOPIC ?
Recent papers
A Data Model and Query Language to Explore Enhanced Links
and Paths in Life Science Sources. George Mihaila
and Felix Naumann and Louiqa Raschid
and Maria Esther Vidal. PDF
Challenges in Selecting Paths for Navigational Queries:
Trade-Off of Benefit of Path versus Cost of Plan. Vidal, M.E. and Raschid, L. and Mestre, J. PDF Proceedings
of the Seventh International Workshop on the Web and Databases (WebDB 2004), in conjunction with ACM SIGMOD.
Querying Web-Accessible Life Science Sources: Which paths to
choose? Jens Bleiholder, Felix Naumann,
Louiqa Raschid, Maria Esther Vidal. PDF Proceedings of the Workshop on
Information Integration on the Web (IIWeb), in
conjunction with VLDB 2004.
A Protocol to Extract and Generate Links Capturing Marker
Semantics from PubMed to the Human Genome. Alex Lash and Woie-Jyu Lee and Louiqa Raschid. PDF Under review.
Enhancing the Semantics of Links and Paths
in Life Science Sources. S. Heymann, F. Naumann, L. Raschid and P. Rieger. PDF
Workshop on Database Issues in Biological
Databases (DBiBD), in conjunction with ICDT 2005.
Project Sponsors
This research is sponsored by the National Science
Foundation NSF
SEIII Program .