A semidiscrete matrix decomposition for latent semantic indexing information retrieval

TitleA semidiscrete matrix decomposition for latent semantic indexing information retrieval
Publication TypeJournal Articles
Year of Publication1998
AuthorsKolda TG, O'Leary DP
JournalACM Trans. Inf. Syst.
Volume16
Issue4
Pagination322 - 346
Date Published1998/10//
ISBN Number1046-8188
Keywordsdata mining, latent semantic indexing, semidiscrete decomposition, singular-value decomposition, text retrieval
Abstract

The vast amount of textual information available today is useless unless it can be effectively and efficiently searched. The goal in information retrieval is to find documents that are relevant to a given user query. We can represent and document collection by a matrix whose (i, j) entry is nonzero only if the ith term appears in the jth document; thus each document corresponds to a columm vector. The query is also represented as a column vector whose ith term is nonzero only if the ith term appears in the query. We score each document for relevancy by taking its inner product with the query. The highest-scoring documents are considered the most relevant. Unfortunately, this method does not necessarily retrieve all relevant documents because it is based on literal term matching. Latent semantic indexing (LSI) replaces the document matrix with an approximation generated by the truncated singular-value decomposition (SVD). This method has been shown to overcome many difficulties associated with literal term matching. In this article we propose replacing the SVD with the semidiscrete decomposition (SDD). We will describe the SDD approximation, show how to compute it, and compare the SDD-based LSI method to the SVD-based LSI methods. We will show that SDD-based LSI does as well as SVD-based LSI in terms of document retrieval while requiring only one-twentieth the storage and one-half the time to compute each query. We will also show how to update the SDD approximation when documents are added or deleted from the document collection.

URLhttp://doi.acm.org/10.1145/291128.291131
DOI10.1145/291128.291131