III EAGER: Exploratory Research on the Annotated Biological Web

The life science research community generates an abundance of data on genes, proteins, sequences, etc. These are captured The life science research community generates an abundance of The life science research community generates an abundance of data on genes, proteins, sequences, etc. These are captured in publicly available resources such as Entrez Gene, PDB and PubMed and in focused collections such as TAIR and OMIM. A number of ontologies such as GO, PO and UMLS are in use to increase interoperability. Records in these resources are typically annotated with controlled vocabulary (CV) terms from one or more ontologies. Records are often hyperlinked to those in other repositories, creating a richly curated biological Web of semantic knowledge.

The objective of this project is to develop tools to explore and mine this rich Web of annotated and hyperlinked entries so as to discover meaningful patterns. The approach builds upon finding potentially meaningful and novel associations between pairs of CV terms cross multiple ontologies. The bridge of associations across ontologies reflects annotation practices across repositories. A variety of graph data mining and network analysis techniques are being explored to find complex patterns of groups of CV terms cross multiple ontologies. The intent is to identify biologically meaningful associations that yield nuggets of actionable knowledge to be made available to the scientist together with a set of golden publications that support the identified patterns.

The intellectual merit of the project is that it is unique in comparison to other bioinformatics data integration and analysis projects. Data is integrated from across numerous sources including genes, gene annotations, ontologies, and the literature. The exploratory nature (EAGER) of this research is both with respect to the biological and the computer science disciplines. From the biological viewpoint, a high level of speculation is associated with any discovered biological patterns. Discovered patterns night not necessarily meet criteria for experimental validation. The research methodology combines algorithmic and analytical techniques from multiple computer science sub-disciplines. While specific technical innovations are expected, an inter-related set of computer science challenges needs to be defined. This research has the potential for broader impact since the methodology can be applied to any type of interlinked resources on the biological semantic Web as well as to any collection of hyperlinked resources.

Interesting Links

  • LSLINKS - A searchable database of annotated hyperlinked data between Entrez Gene, PubMed and OMIM and a data mining tool to discover significant pairs of terms between two ontologies.
  • GeneDocs

Project Sponsors

This award is sponsored by the National Science Foundation CISE III Program

Participants

Recent papers

  • Dense Subgraphs with Restrictions and Applications to Gene Annotation Graphs. Barna Saha and Allie Hoch and Samir Khuller and Louiqa Raschid and Xiao-Ning Zhang Please email Louiqa louiqa at umiacs dot umd edu for a copy of the paper.
  • Ranking Target Objects of Navigational Queries. Yao Wu, Louiqa Raschid, Maria Esther Vidal, Panayiotis Tsaparas, Woei-Jyh Lee, Padmini Srinivasan and Aditya Sehgal. Proceedings of the Workshop on Information and Data Management, p. 1, vol. 1, (2006). PDF.
  • Using Annotations from Controlled Vocabularies to Find Patterns in LSLinks. Woei-Jyh Lee and Louiqa Raschid and Padmini Srinivasan and Nigam Shah and Daniel Rubin and Natasha Noy. Proceedings of the Conference on Data Integration for the Life Sciences, p. 1, vol. 1, (2007). PDF.
  • Mining Meaningful Associations from Annotations in Life Science Data Resources. Lee, Woei-Jyh; Raschid, Louiqa; Srinivasan, Padmini. Proceedings of the Conference on Data Integration for the Life Sciences, p. 1, vol. 1, (2008). Winner of the Swiss Institute of Bioinformatics Best Paper Award. PDF.
  • Flexible and Efficient Querying and Ranking on Hyperlinked Data Sources. Varadarajan, Ramakrishna; Hristidis, Vagelis; Raschid, Louiqa; Vidal, Mari Esther; Rodriguez, Hector; Ibanez, Luis. Proceedings of the EDBT Conference, p. 1, vol. 1, (2009). PDF.