I found a couple of problems with this approach: very long sentences and running time & space.
Very Long Sentences
Other test data I looked at were technical papers. These often have very long sentences. The problem is not controlling for sentence length, but that there are too many theme words for the sentence to express one theme well. One could try breaking these up into sentence shards: a 50-word sentence would become three 15-word shards. If we use parts-of-speech trimming, we can break the shards within runs of filler words.
OptimizationsIt is possible to create an optimized version of LSA that only ranks sentences or terms via Random Indexing. RI deliberately throws away the identity of one side of the sentence/term matrix. It can rank sentences or terms, but not both. This algorithm runs much faster than the full sentence/term version, and is the next step in creating an interactive document summarizer or a high-quality tag cloud generator.
Random ProjectionRandom Projection is a wonderfully non-intuitive algorithm. If you multiply a matrix with another matrix filled with random numbers, the output matrix will have the same cosine distance ratios across all pairs of vectors. In a document-term matrix, all of the documents will still have the same mutual 'relevance' (measured by cosine distance which magically matches tf-idf relevance ranking).
If the random matrix is the (transposed) size of the original matrix, the row and column vectors will have the above distance property. If the random matrix has more or fewer dimensions on one side, the resulting matrix will retain the identity of vectors from the constant side, but will lose the identity of the vectors on the varying side. If you have a sentence-term matrix of 50 sentences x 1000 terms, and you multiply it by a random matrix of 100 x 50, you get a matrix of 50 sentences x 100 "sketch" vectors, where every sketch is a low-quality summarization of the term vectors. The 50 sentences will still have the same cosine ratios; we have merely thrown away the identity of the term vectors. Since the running time of SVD is very non-linear, we now have a much faster dataset that will give us the orthogonal decomposition of the sentences (but the terms are forgotten). We can invert this and calculate the SVD for terms without sentences.
Random IndexingThe above description posits creating the entire sentence/term matrix, then doing the complete multiplication. In fact, you can create the resulting sentence/sketch matrix directly, one term at a time. This will considerably cut memory usage and running time. I include this explanation because there are few online sources describing Random Indexing, and the following forgets to explain RI's roots in Random Projection.
Fast Random ProjectionAlso see Achlioptas et. al. for a wicked speed-up of random indexing that is very suitable for densely populated data.