BioFast: Efficient and Seamless Life Science Data Management
There has been an exponential growth in the amount of data on the
web that is available to the biological enterprise. This wealth of complex and
diverse data presents significant opportunities and challenges for
data integration and seamless access to these sources.
Our research will apply prior expertise with developing data
integration architectures based on wrappers and
mediators to provide seamless access to heterogeneous Web
accessible sources. We will develop techniques from areas such
as query optimization, adaptive query evaluation, machine learning and schema
mapping and
integration of heterogeneous databases, to solve problems of data
integration with biological
data sources. Our research will address the following
tasks:
- Seamless and Efficient Access: The task of query optimization
becomes increasingly complex as we consider the Logical Map of
Scientific classes, e.g. genes, and the Physical Map of multiple
alternate sources, maultiple links between sources, and diverse query
processing or domain specific computational capabilities. We extend our
prior expertise of WQO (Web Query Optimization) to the biological
domain to select sources and capabilities and generate efficient query
execution plans. We use a simple regular expression based query
language to explore the search space of alternate physical paths
through the Physical Map.
- Entity Identification: A scientific entity instance may be
described in multiple autonomous sources. Multi-strategy learning at
the data source, schema, domain and instance level uses the concept of
Identity Link to identify equivalent identifiers for the same entity
instance. e.g., the gene TP53.
- Logical Link (Path) Characterization: A biologist is interested
in logical links (paths) between scientific objects in the Logical Map,
e.g., a sequence and a publication that studies the sequence. There may
be multiple independent physical implementations of these links between
different sources, in the Physical Map. We define parameterized logical
links (logical paths) between data sources and define (physical)
properties to describe these links. A topology of links (paths) based
on their properties and concepts of Logical Link (Path) Equivalence
will be developed.
- An interesting application of our research is Cancer Related
Target Proteins: Identification and Complete Characterization