| BIP Mediator Platform |
BIP on a Mediator Platform
The BIP Toolbox is currently being developed for use on a mediator platform. Mediator technology offers several important advantages over a non-mediated solution.
Why a mediator platform?
BIP, and other bioinformatics pipelines, are dependent on the chosen data sources and target objects (sequence alignment, gene finding, etc.)
The available data sources are many, yet they exhibit a considerable amount of overlap. The choice of data sources used for a particular BIP run should be left up to the scientist, not the pipeline design.
Also, there are often times alternative tools which can be used for various processing tasks within the pipeline. There is currently little support for providing the desired degree of flexibility in the the resource and tool implementations.
Additionally, support is needed to properly evaluate costs and benefits associated with the chosen implementation and options.
A mediator platform provides these benefits:
- Workflow alternatives:
- Data sources. The user can select different data sources or combinations thereof for genome and transcript data.
- Parameters. Operating parameters, such as quality filters, can be fine-tuned according to the scientist's needs.
- Alternate implementations. There may be several tools available to perform a function in the pipeline. For example, the BIP pipeline uses a BLAT and SIM4 combination for sequence alignment. A different tool, such as IBM's GeneWise, may be used instead
- Cost analysis. Remote and local cost metrics are used to determine costs associated with each pipeline implementation from data sources chosen to tools used and their parameters.
- Benefits. The key metrics for determining the benefit of any one implementation is result cardinality, i.e. how much input data makes it's way into the final data object? Different workflow alternatives plus alternate implementations provide different benefits.
Using the mediator platform, a scientist can weigh the costs with the benefits. Cost analysis on a non-mediated platform is much more difficult, and not always reliable.
Which platform is being used?
We chose IBM's WebSphere Information Integrator as the mediator platform. WebSphere Information Integrator (WSII) provides federation of data from disparate data sources, making them appear at the front as as a single large RDBMS. The SQL query language is supported over all of the data sources, even if the sources' native search capabilities fall short of full-featured SQL. Non-SQL search capabilities of these data sources are also made available by WSII.
WS II is general-purpose database integration software, but it also includes wrappers for sources specific to the bioinformatics domain. Wrappers currently available or coming soon provide access to relational data sources (Oracle, Microsoft SQL Server, Sybase), ODBC (e.g., MS Access, MySQL), XML documents, Web Services using wsdl (e.g., many NCBI databases, EMBL, DDBJ), column-delimited text files, Venetica VeniceBridge (for text search engines like Google, Lotus Notes, and Documentum), and scripts that output XML. Bioinformatics-specific wrappers/functions include: BLAST, the NCBI Entrez E-utilities with schemas for PubMed, GenBank Nucleotide, and OMIM, HMMER (hmmpfam and hmmsearch), Genewise, KEGG, and BioRS. Extensibility is provided in two ways. The script and web services wrappers make it easy to incorporate an arbitrary data source into a WS II federation, and wrapper developers kits in C++ and Java enable users to write their own wrappers.
|