Web accessible data sources contain an abundance of data on scientific objects such as genes, protein, sequences, citations, etc. Biologists typically explore these sources by navigating links between entries in data sources (object) as well as paths (informally, concatenations of links). While these links capture a rich semantics that is often well understood by the scientist, the link itself does not explicitly capture or represent meaning. Consequently, scientists spend significant time following links only to reject many data entries that are reached. The lack of explicit meaning also limits the sharing of this knowledge among groups of scientists who are not in the same specialization. Finally, the advent of automated tools such as scripts or mediators that may be used for data gathering and data integration are limited since they have no knowledge of the implicit semantics.
Links between entries in the sources are created for many different reasons. Biologists capture new discoveries of an experiment or study using links, whereas data curators add links to augment, to complete or to make consistent, the knowledge captured among multiple sources. For example, a result reported in a paper in PubMed may lead a curator to insert a link from a data entry in say OMIM to this publication in PubMed. Algorithms insert links automatically when discovering similarities among two data items, e.g., to represent sequence similarity following a BLAST search. Manually curated links added by record originators or curators are generally inserted into the database record itself, whereas algorithmically generated links are generally kept in a separate linking table. Thus, the simple unlabeled physical links that are in use today are insufficient to represent such subtle and diverse relationships.
We have addressed this problem by developing a methodology of e-links between entries in sources. The e-link enhances existing links with a label (meaning). We further develop a data model and a query language that can exploit e-links while traversing paths through the data sources. Contributions of this research include the following: (1) A methodology that includes information extraction, link generation and link labeling to enhance the semantics of e-links. (2) An extended example of e-link extraction and labeling where we enhance the link from PubMed entries to markers in the human genome. (3) A proof-of-concept prototype comprising the extraction protocol, a hierarchy of link labels, and an experiment on machine assisted labeling of links.
A Data Model and Query Language to Explore Enhanced Links and Paths in Life Science Sources. George Mihaila and Felix Naumann and Louiqa Raschid and Maria Esther Vidal. PDF
Challenges in Selecting Paths for Navigational Queries:
Trade-Off of Benefit of Path versus Cost of Plan.
Querying Web-Accessible Life Science Sources: Which paths to choose? Jens Bleiholder, Felix Naumann, Louiqa Raschid, Maria Esther Vidal. PDF Proceedings of the Workshop on Information Integration on the Web (IIWeb), in conjunction with VLDB 2004.
A Protocol to Extract and Generate Links Capturing Marker Semantics from PubMed to the Human Genome. Alex Lash and Woie-Jyu Lee and Louiqa Raschid. PDF Under review.
Enhancing the Semantics of Links and Paths in Life Science Sources. S. Heymann, F. Naumann, L. Raschid and P. Rieger. PDF Workshop on Database Issues in Biological Databases (DBiBD), in conjunction with ICDT 2005.
This research is sponsored by the National Science Foundation NSF SEIII Program .