Sunday, April 8, 2018

Document Summarization with LSA Part 0: Basics

This is an investigation into analyzing documents with Latent Semantic Analysis, or LSA. In particular, we will find the key sentences and tag clouds in a technical paper with Singular Value Decomposition, inspired by this paper: 
The application is add summaries to the LucidFind application, a search engine devoted to open-source text processing projects.

Core concepts

Document Corpus

A corpus is just a collection of documents. There are many standardized text document corpuses out there used a data in research, the way standardized white mice strains are used in laboratory research on RAs. In this case, we use the sentences in one document as the corpus for that document. The matrix has sentences for rows and terms for columns.

Term Vector

A term vector is a "bag of words" with numbers attached. We are going to take all of the words in the document corpus and create a matrix where documents are on one side and words are on the other, and add a value for each word where it occurs in a document. The standard ways are to add a count, or just set it to one. This is a matrix where the rows are term vectors for the documents. This gives a numerical representation of the document corpus which has all grammatical information ripped out. This Bayesian representation is surprisingly effective, as we will see.

Singular Value Decomposition

SVD is a linear algebra operation. Given a matrix, it creates three matrices based on the original matrix. The one we care about is the "left feature matrix", commonly called U in the literature. The rows are "document feature vectors", which represent a kind of importance for each document.  A feature, in machine learning, refers to signal in noisy data. Applied to a matrix of sentences v.s. terms, the document with the largest value in the left-hand column is the most "important" in the corpus. The remaining values in the row also contribute to importance.

Tying it together

LSA makes a bag-of-words matrix from the document corpus, and then tries to find the documents that most reflect the "essence" of the corpus. In this project we try to find the most essential sentences in a technical paper by doing a linear algebra analysis using both the sentences and individual words.

1 comment: