Thursday, September 6, 2012

Document Summarization with LSA #1: Introduction

Document Summarization with LSA

This is a several-part series on document summarization using Latent Semantic Analysis (LSA). I wrote a document summarizer and did an exhaustive measurement pass using it to summarize newspaper articles from the first Reuters corpus. The code is structured as a web service in Solr, using Lucene for text analysis and the OpenNLP package for tuning the algorithm with Parts-of-Speech analysis.


Document summarization is about finding the "themes" in a document: the important words and sentences that contain the core concepts. There are many algorithms for document summarization. This algorithm uses Latent Semantic Analysis, which uses linear algebra to analyze how words and sentences are used in common. LSA is based on the "bag of words" concept of "term vectors", or a list of all words and how often it are used in each document. LSA uses Singular Value Decomposition (SVD) to tease out which words are used the most with other words, and which sentences use the most theme words together. Document Summarization with LSA uses SVD to give us main and secondary sentences which have the strongest collections of theme words and yet a minimal number of theme words in common.

Every document has words which express the themes of the document. These words are frequent, but they are also used together in sentences. We want to find the most important sentences and words; the main sentences should be shown as the summary, and the most important words are good search words for this document.

Orthogonal Sentences

A key idea is that the most important and second most important sentences in a document are independent: they tend to share few words. The most important sentence in a document expresses the main theme of the document, and the second most important uses other theme words to elaborate on the main sentence. When we express the sentences and words in a bag-of-words matrix, SVD can analyze the sentences and words in relation to each other, by how the terms are used together in sentences. It creates a sorted list of documents which are as orthogonal as possible: which means that their collective theme words are as different as possible.

Note that since this technique treats documents and terms symmetrically, it also creates a sorted list of terms by how important they are- this makes for a good tag cloud.


Here is an example of the first two sentences from a financial newswire article.
Dean Foods Co expects earnings for the fourth quarter ending May 30 to exceed those of the same year-ago period, Chairman Kenneth Douglas told analysts. In the fiscal 1986 fourth quarter the food processor reported earnings of 40 cts a share.
The first sentence is long, and expresses the theme of the article. The second sentence elaborates on the first sentence, and does not share any real words except earnings. Note that in order to avoid repeating the company name, and to give more information, the second sentence elaborates on the first sentence and refers to Dean Foods as the food processor

Further Reading

This is an important paper on using SVD to summarize documents. It appears to be the original proposal for this technique:
Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis
Gong and Liu, 2002

No comments:

Post a Comment