1Lawrence Berkeley National Laboratory, Berkeley, California, U. S. A.
2Pennsylvania State University, University Park, Pennsylvania, U. S. A.
We present a theoretical foundation based on subspaces for latent semantic indexing (LSI) in information retrieval. We show that our model leads to a low-rank-plus-shift structure that is approximately satisfied by the cross-product of the term-document matrices. This structure can be exploited for the compution of the partial singular value decomposition (SVD) of a large sparse term-document matrix used in LSI. We also discuss several parallel implementation issues and present emperical numerical results on Cray T3E using text collections with millions of documents.