Monday, February 20, 2012

Summarizing Documents with Machine Learning: Part 1

Summarizing Documents with Machine Learning: Part 1

This is an investigation into analyzing documents with Latent Semantic Analysis, or LSA. In particular, we will find the key sentences and tag clouds in a technical paper with Singular Value Decomposition, inspired by this paper: 
The application is add summaries to the LucidFind application, a search engine devoted to open-source text processing projects.

Core concepts

Document Corpus

A corpus is just a collection of documents. There are many standardized text document corpuses out there used a data in research, the way standardized white mice strains are used in laboratory research on RAs. In this case, we use the sentences in one document as the corpus for that document. The matrix has sentences for rows and terms for columns.

Term Vector

A term vector is a "bag of words" with numbers attached. We are going to take all of the words in the document corpus and create a matrix where documents are on one side and words are on the other, and add a value for each word where it occurs in a document. The standard ways are to add a count, or just set it to one. This is a matrix where the rows are term vectors for the documents. This gives a numerical representation of the document corpus which has all grammatical information ripped out. This Bayesian representation is surprisingly effective, as we will see.

Singular Value Decomposition

SVD is a linear algebra operation. Given a matrix, it creates three matrices based on the original matrix. The one we care about is the "left feature matrix", commonly called U in the literature. The rows are "document feature vectors", which represent a kind of importance for each document.  A feature, in machine learning, refers to signal in noisy data. Applied to a matrix of sentences v.s. terms, the document with the largest value in the left-hand column is the most "important" in the corpus. The remaining values in the row also contribute to importance.

Tying it together

LSA makes a bag-of-words matrix from the document corpus, and then tries to find the documents that most reflect the "essence" of the corpus. In this project we try to find the most essential sentences in a technical paper by doing a linear algebra analysis using both the sentences and individual words.

Preliminary Results

Prelims are interesting, but nowhere near perfect. I've done this on one test document, and it clearly homes in on important declarative sentences. The test document is wrong for this- it's filled with formulae, and so do you make terms out of greek letters?

Raw Data

I've attached the raw text used in this exercise, and the sentences ordered by their SVD norm. I've tried several variations on the algorithm, and the first few sentences always turn up, so it's a pretty strong signal. Here are the sorted sentences, with the sentence # in the paper and the sorting value.
The most interesting:
{216, 1.0000000000000173} PERFORMANCE OF AVERAGED CLASSIFIERS 1719 and Mason [9] show how to break down the problem of learning alternating decision trees (a class of rules which generalizes decision trees and boosted decision trees) into a sequence of simpler learning problems using boosting.
{36, 1.0000000000000162} Second, we choose a class of prediction rules (mappings from the input to the binary output) and assume that there are prediction rules in that class whose probability of error (with respect to the distribution D) is small, but not necessarily equal to zero.
{44, 1.0000000000000142} On the other hand, if we allow the algorithm to output a zero, we can hope that the algorithm will output zero on about 4% of the input, and will be incorrect on about 1% of the data.
{26, 1.0000000000000129} Instead the basic argument is that the Bayesian method is always the best method, and therefore, the only important issues are how to choose a good prior distribution and how to efficiently calculate the posterior average.
{123, 1.0000000000000127} Unlike the event of a classification mistake, which depends both on the predicted label and the actual label, the event of predicting 0 does not depend on the actual label.
{179, 1.0000000000000122} For example, if the goal is to detect a rare type of instance within a large set, the correct method might be to sort all instances according to their log- ratio score and output the instances with the highest scores.
The middle:
{202, 1.0000000000000022} In other words, the behavior of our algorithm is, in fact, similar to that of large margin classifiers.
{11, 1.000000000000002} By minimiz- ing this cost, the learning algorithm attempts to minimize both the training error and the amount of overfitting.
{20, 1.000000000000002} PERFORMANCE OF AVERAGED CLASSIFIERS 1699 considerable experimental evidence that such averaging can significantly reduce the amount of overfitting suffered by the learning algorithm.
{21, 1.000000000000002} However, there is, we believe, a lack of theory for explaining this reduction.
{42, 1.000000000000002} MANSOUR AND R.
{55, 1.000000000000002} Of course, if the generated classifier outputs zero most of the time, then there is no benefit from having it.
{139, 1.000000000000002} Doing this also improves the guaranteed performance bounds because it reduces |H |.
{210, 1.000000000000002} Obviously, calculating the error of all of the hypotheses in the class is at least as hard as finding the best hypothesis and probably much harder.
{9, 1.0000000000000018} To overcome this problem, one usually uses either model-selection or reg- ularization terms.
{41, 1.0000000000000018} FREUND, Y.
{88, 1.0000000000000018} Note that the statements of Theorems 1 and 2 have no dependence on the hypothesis class H .
{92, 1.0000000000000018} The second inequality uses the fact that changing one example can change the empirical error by at most 1/m.
{95, 1.0000000000000018} Given x, we partition the hypothesis set H into two.
And the least:
{6, 1.0} Consider a binary classification learning problem.
{16, 1.0} 1Supported in part by a grant from the Israel Academy of Science.
{50, 1.0} Each of these makes two mistakes on this data set.
{149, 1.0} 1714 THEOREM 6.
{150, 1.0} FREUND, Y.
{238, 1.0} FREUND, Y.
{112, 0.9999999999999999} FREUND, Y.
{155, 0.9999999999999998} MANSOUR AND R.
{208, 0.9999999999999998} Consider first the computational issue.
{105, 0.999999999999999} MANSOUR AND R.
{239, 0.999999999999999} MANSOUR AND R.
{5, 0.9999999999999972} Introduction.
Notes to experts:
  • The terms were simple counts.
  • The sentences are sorted by the norms of the feature vectors. With the vectors multiplied by S, the longest sentences went straight to the top. Without S, length was nowhere near as strong.
  • The input data came from block copy&paste out of a PDF reader, processed with the OpenNLP Sentence Detector. It is suitably messy.
  • MANSOUR and FREUND are the authors; these terms appear in headers and footers and are thus important terms.

Next steps

Data cleaning with OpenNLP

Many of the sentences are so long that perhaps breaking longer sentences into clauses will give a more tractable dataset. Sentences with "A, however, B" and "C?" are probably not central to the discussion. OpenNLP is an Apache project for Natural-Language Processing. It has a sentence parsing toolkit that can help us split long sentences.

Tuning

The literature recommends various formulae for the term counts.

Tag Clouds

The important words should be the tag cloud, right?