Content Diffusion on the Web Graph

Dragomir R. Radev

University of Michigan, Ann Arbor


UMIACS Computational Linguistics Colloquium

March 3, 2004, 11:00am, AVW Room 2120


 It is known that context words tend to be self-triggers, that is, the probability of a content word to appear more than once in a document, given that it already appears once, is significantly higher than the probability of the first occurrence. We look at self-trigger ability across hyperlinks on the Web.  We show that the probability of a word wj to appear in a Web document di depends on the presence of wj in documents pointing to di.

We back our claim using two experiments which deal with single word and two word queries from the query log of a major commercial search engine and discuss the implications of the findings to Document Modeling and Information Retrieval. In Document Modeling, we will propose the use of a correction factor, $R$, which indicates how much more likely a word is to appear in a document given that another document containing the same word is linked to it. In Information Retrieval, we suggest the concept of a document closure which can be automatically computed in order to identify words that are not present in a document, but by virtue of their presence in documents linked to the document, should be added to the index for it.


 About the Speaker:

 

Dragomir R. Radev is an Assistant Professor of Information, Electrical Engineering and Computer Science, and Linguistics at the University of Michigan, Ann Arbor.  He holds a Ph.D. in Computer Science from Columbia University.  Before joining Michigan, he was a Research Staff Member at IBM's TJ Watson Research Center in Hawthorne, NY.  He is the author of more than 45 papers on text summarization; question answering, machine translation, text generation, information extraction, and information retrieval.  Dr Radev's current research on probabilistic and link-based methods for exploiting very large textual repositories, representing and acquiring knowledge of genome regulation, and semantic entity and relation extraction from Web-scale text document collections is supported by NSF and NIH.  He serves on the HLT-NAACL advisory committee, was recently reelected as treasurer of NAACL, is a member of the editorial boards of JAIR and Information Retrieval, and is a four-time finalist at the ACM programming finals (as contestant in 1993 and as coach in 1995-1997). Additional information is available at http://tangra.si.umich.edu/clair/

 

 For the colloquium series schedule, see the UMD Computational http://www.umiacs.umd.edu/research/CLIP/colloq/.  If you are interested in meeting with the speaker, please contact Doug <http://www.glue.umd.edu/~oard/>  Oard (oard@umiacs.umd.edu

<mailto:oard@umiacs.umd.edu> ).