Document Summarization with LSA
Introduction
Every document has words which express the themes of the document. These words are frequent, but they are also used together in sentences. We want to find the most important sentences and words; the main sentences should be shown as the summary, and the most important words are good search words for this document.
Orthogonal Sentences
A key idea is that the most important and second most important sentences in a document are independent: they tend to share few words. The most important sentence in a document expresses the main theme of the document, and the second most important uses other theme words to elaborate on the main sentence. When we express the sentences and words in a bag-of-words matrix, SVD can analyze the sentences and words in relation to each other, by how the terms are used together in sentences. It creates a sorted list of documents which are as orthogonal as possible: which means that their collective theme words are as different as possible.
Note that since this technique treats documents and terms symmetrically, it also creates a sorted list of terms by how important they are- this makes for a good tag cloud.
Note that since this technique treats documents and terms symmetrically, it also creates a sorted list of terms by how important they are- this makes for a good tag cloud.
Example
Here is an example of the first two sentences from a financial newswire article.
Dean Foods Co expects earnings for the fourth quarter ending May 30 to exceed those of the same year-ago period, Chairman Kenneth Douglas told analysts. In the fiscal 1986 fourth quarter the food processor reported earnings of 40 cts a share.
The first sentence is long, and expresses the theme of the article. The second sentence elaborates on the first sentence, and does not share any real words except earnings. Note that in order to avoid repeating the company name, and to give more information, the second sentence elaborates on the first sentence and refers to Dean Foods as the food processor.
Further Reading
This is an important paper on using SVD to summarize documents. It appears to be the original proposal for this technique:
Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis
Gong and Liu, 2002
http://www.cs.bham.ac.uk/~pxt/IDA/text_summary.pdf
Here are two good tutorials on Singular Value Decomposition and LSA:
Singular Value Decomposition (SVD) Tutorial
Latent Semantic Analysis (LSA) Tutorial
Here are two good tutorials on Singular Value Decomposition and LSA:
Singular Value Decomposition (SVD) Tutorial
Latent Semantic Analysis (LSA) Tutorial
No comments:
Post a Comment