Many vector-based methods of information retrieval (IR) use a document collection for training---to build a term expansion thesaurus, a reduced dimensional ("latent") semantic space, or a cross-language term-association network. We can think of the training corpus as defining a vector space; at retrieval time, documents and queries are projected onto this space to have their similarities computed. We describe how various IR methods including VSM, LSI and GVSM can be regarded as "projection methods" and highlight the critical role played by the singular values of the training matrix in these algorithms. An operation we call "spectral flattening" provides a unifying view for IR algorithms for monolingual and translingual retrieval. Based on an empirical study of the singular values of a variety of text collections, we define a computationally tractable approximation for spectral flattening. We provide preliminary results for several large monolingual and bilingual test collections.
For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Philip Resnik (resnik@umiacs.umd.edu).