Relational, XML and IR systems are converging to text-enabled search in entity-relation (ER) graphs (V,E) where nodes V are typed entities (email, paper, person, conference, company) and edges E are typed relations (wrote, cited, works-for). ER-graphs also arise when sentences are parsed, or named entities annotated, and then these annotations are connected to lexical networks and ontologies.
Queries involve uninterpreted strings and structure restrictions. We may wish to rank mentions of instances of a given type (distance) based on their textual proximity to query keywords (Rome Helsinki). Such named-entity searches are crude but effective filters for simple questions like "what is the distance between Rome and Helsinki". Or, in a desktop or enterprise search setting, we may wish to rank potential reviewers for a paper, based on proximity along diverse paths between the paper under consideration and people in our emails and address books.
We are building public-domain graph database systems that can deal with both forms of proximity search. I will focus on two aspects of the work. The first is the design of broad families of parameterized ranking functions suited to ER graphs, which can be trained effectively from relevance judgments. The second is the design of new kinds of workload-cognizant indices to support graph proximity queries, while striking useful trade-offs between index space and query execution cost. I will describe new techniques to estimate query-processing cost and cost-driven index compaction based on query logs.
Joint work with Alekh Agarwal, Manish Gupta, Amit Pathak, Kriti Puniyani, Sujatha Das. More infomation is available at http://www.cse.iitb.ac.in/~soumen/doc/netrank/ and http://www.cse.iitb.ac.in/~soumen/doc/www2006i/. Partly supported by IBM Research and Microsoft Research.
Soumen Chakrabarti received his B.Tech in Computer Science from the Indian Institute of Technology, Kharagpur, in 1991 and his M.S. and Ph.D. in Computer Science from the University of California, Berkeley in 1992 and 1996. At Berkeley he worked on compilers and runtime systems for running scalable parallel scientific software on message passing multiprocessors.
He was a Research Staff Member at IBM Almaden Research Center from 1996 to 1999, where he worked on the Clever Web search project and led the Focused Crawling project. In 1999 he joined the Department of Computer Science and Engineering at the Indian Institute of Technology, Bombay, where he has been an Associate professor since 2003. In Spring 2004 he was Visiting Associate professor at Carnegie-Mellon University.
He has published in the WWW, SIGIR, SIGKDD, SIGMOD, VLDB, ICDE, SODA, STOC, SPAA and other conferences as well as Scientific American, IEEE Computer, VLDB and other journals. He holds eight US patents on Web-related inventions. He has served as technical advisor to search companies and vice-chair or program committee member for WWW, SIGIR, SIGKDD, VLDB, ICDE, SODA and other conferences, and guest editor or editorial board member for DMKD and TKDE journals. He is also author of a book on Web Mining.
His current research interests include integrating, searching, and mining text and graph data models, exploiting types and relations in search, and Web graph and popularity analysis. More information is available at http://www.cse.iitb.ac.in/~soumen/.
This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.