The CLIP Colloquium Series presents...


Using Sketches to Estimate Associations

Ken Church (Microsoft Research)
April 5, 2006, 11:00am, AVW 2120

Slides

Suppose one wanted to use page hits to estimate associations or mutual information, as Turney, Etzioni and others have done. We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are associated or not. A powerful sampling technique called Sketches was originally introduced by Broder to remove duplicate Web pages in AltaVista.

We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the margins (document frequencies) and the size of the collection. When we know the margins, we ought to use them. Doing so, reduces the variance (errors) by about 1/2. Margin constraints and maximum likelihood methods also have applications in other dimension reducing approximate algorithms such as random projections. Sampling methods become more and more important with larger and larger collections. At Web scale, sampling rates as low as 10^-4 may suffice, suggesting that a single PC (with sampling) can keep up with a cluster of 10^4 machines (without sampling).

(joint work with Ping Li from Stanford Statistics Dept)

About the Speaker

I moved to Microsoft Research in late 2003. Before that, I was the head of a data mining department in AT&T Labs-Research (formally AT&T Bell Labs, Murray Hill, NJ). I received my BS, Masters and PhD from MIT in computer science in 1978, 1980 and 1983, and immediately joined AT&T. I have worked in many areas of computational linguistics including: acoustics, speech recognition, speech synthesis, OCR, phonetics, phonology, morphology, word-sense disambiguation, spelling correction, terminology, translation, lexicography, information retrieval, compression, language modeling and text analysis. I enjoy working with very large corpora such as the Associated Press newswire (1 million words per week) and larger datasets such as larger data sets such as telephone call detail (1-10 billion records per month).


This talk is part of the CLIP Colloquium Series, organized by Jimmy Lin (jimmylin -at- umd .dot. edu). For the complete schedule, please visit http://www.umiacs.umd.edu/research/CLIP/colloq/.