Content Diffusion on the Web Graph
Dragomir
R. Radev
It
is known that context words tend to be self-triggers, that is, the probability
of a content word to appear more than once in a document, given that it already
appears once, is significantly higher than the probability of the first
occurrence. We look at self-trigger ability across hyperlinks on the Web. We show that the probability of a word wj to
appear in a Web document di
depends on the presence of wj in documents pointing to di.
We back our claim
using two experiments which deal with single word and two word queries from the
query log of a major commercial search engine and discuss the implications of
the findings to Document Modeling and Information Retrieval. In Document
Modeling, we will propose the use of a correction factor, $R$, which indicates
how much more likely a word is to appear in a document given that another document
containing the same word is linked to it. In Information Retrieval, we suggest
the concept of a document closure which can be automatically computed in order
to identify words that are not present in a document, but by virtue of their
presence in documents linked to the document, should be added to the index for
it.
About the
Speaker:
Dragomir R. Radev is an Assistant
Professor of Information, Electrical Engineering and Computer Science, and
Linguistics at the
For
the colloquium series schedule, see the UMD Computational http://www.umiacs.umd.edu/research/CLIP/colloq/. If you are interested in meeting with the
speaker, please contact Doug <http://www.glue.umd.edu/~oard/> Oard (oard@umiacs.umd.edu
<mailto:oard@umiacs.umd.edu> ).