Thursday, September 6, 2012

Document Summarization with LSA #2: Test with newspaper articles

The Experiment

This analysis evaluates many variants of the LSA algorithm against some measurements appropriate for an imaginary document summarization UI. This UI displays the most important two sentences with the important theme words highlighted. The measurements try to match the expectations of the user.

Supervised v.s. Unsupervised Learning

Machine Learning algorithms are classified as supervised, unsupervised, and semi-supervised.
A supervised algorithm creates a model (usually statistical) from training data, then applies test data against the model. An unsupervised algorithm is applied directly against test data without a pre-created model. A semi-supervised algorithm uses tagged and untagged data; this works surprisingly well in a lot of contexts.

The Data

We are going to use unsupervised learning. The test data is a corpus of newspaper articles called Reuters-21758, which was collected and published by the Reuters news agency to assist text analysis research. This dataset is not really tagged, but is appropriate for this experiment. Newspaper articles are written in a particular style which is essentially pre-summarized. In a well-written newspaper article, the first sentence (the lede) is the most important sentence, and the second  sentence is complementary and usually shares few words with the first. The rest of the article is usually structured in order from abstraction to detail, called the Inverted Pyramid form. And, newspaper articles are the right length to summarize effectively. 

Example Data

We limited the tests to sentences which were from 15-75 sentences long. The entire corpus is 21 thousand articles. This limits the test space to just under 2000. Here is a sample newspaper article:
26-FEB-1987 15:26:26.78
DEAN FOODS <DF> SEES STRONG 4TH QTR EARNINGS
Dean Foods Co expects earnings for the fourth quarter ending May 30 to exceed those of the same year-ago period, Chairman Kenneth Douglas told analysts. In the fiscal 1986 fourth quarter the food processor reported earnings of 40 cts a share. Douglas also said the year's sales should exceed 1.4 billion dlrs, up from 1.27 billion dlrs the prior year. He repeated an earlier projection that third-quarter earnings "will probably be off slightly" from last year's 40 cts a share, falling in the range of 34 cts to 36 cts a share. Douglas said it was too early to project whether the anticipated fourth quarter performance would be "enough for us to exceed the prior year's overall earnings" of 1.53 dlrs a share. In 1988, Douglas said Dean should experience "a 20 pct improvement in our bottom line from effects of the tax reform act alone."
President Howard Dean said in fiscal 1988 the company will derive  benefits of various dairy and frozen vegetable acquisitions from Ryan Milk to the Larsen Co. Dean also said the company will benefit from its acquisition in late December of Elgin Blenders Inc, West Chicago. He said the company is a major shareholder of E.B.I. Foods Ltd, a United Kingdom blender, and has licensing arrangements in Australia, Canada, Brazil and Japan. "It provides an entry to McDonalds Corp <MCD> we've been after for years," Douglas told analysts.  Reuter &#3;
As you can see, the text matches the concept of the inverted pyramid. The first two sentences are complementary, have no repeated words, and few words in common. Repeated concepts are described with different words: "Dean Foods Co" in the first sentence is echoed as "the food processor" in the second, while "expects earnings" is matched by "recorded earnings".  This style seems real-world enough to be good test data for this algorithm suite. We did not comb the data for poorly structured articles or garbled text.

The Code

There are two bits of code involved: Singular Value Decomposition (explained previously) and Matrix "Regularization", or "Conditioning". The latter refers to applying non-linear algorithms to the document-term matrix which make the data somewhat more amenable to analysis. Several algorithms have been investigated for this purpose.

The raw term counts data supplied by a document/term matrix may not always be the right way to approach the data. Matrix Regularization algorithms are functions which use the entire dataset to affect each cell in the matrix. The contents of a document/term matrix after regularization are referred to as "term weights".

There are two classes of algorithm for creating term weights. Local algorithms alter each cell, while global algorithms create a global vector of values per term that are applied to all documents with that term, and likewise a global vector of values per document which is applied to all terms in that document. Local algorithms include term frequency (tf) which uses the raw matrix, binary which replaces each term count greater than one with a one, and some others which find a unique value for a cell based on the document and term vectors which cross at that cell.

Global algorithms for term vectors include normalizing the term vector and finding the inverse document frequency (idf) of the term. The LSA implementation includes implementations of these local and global algorithms, and any pair can be used together. Thus, tf-idf is achieved by combining the local tf algorithm and the global idf algorithm. The literature recommends a few of these combinations (tf-idf and log-entropy) as the most effective, but this test found a few other combinations which were superior. For document vectors, cosine normalization and a new "pivoted cosine" normalization are recommended for counteracting the dominance of longer sentences. These are not yet implemented. Existing combinations of local and term vector algorithms do a good job of suppressing sentence length problems. We will see later on that:
  • one of the term vector algorithms is by far the best at everything, 
  • one of the local algorithms is the overall winner, and 
  • one of the others does a fantastic job at sentence length problems but is an otherwise average performer.

The Result

The above article gave the following result for the 'binary' algorithm:
<lst name="analysis">
  <lst name="summary">
    <lst name="stats">
      <lst name="sentences">
        <int name="count">12</int>
      </lst>
      <lst name="terms">
        <int name="count">150</int>
      </lst>
      <lst name="stems">
        <int name="count">150</int>
      </lst>
    </lst>
    <lst name="terms">
      <lst name="term">
        <str name="text">the</str>
        <double name="strength">1.0</double>
      </lst>
      <lst name="term">
        <str name="text">of</str>
        <double name="strength">0.942809041582063</double>
      </lst>
      <lst name="term">
        <str name="text">said</str>
        <double name="strength">0.8164965809277259</double>
      </lst>
      <lst name="term">
        <str name="text">from</str>
        <double name="strength">0.7453559924999301</double>
      </lst>
      <lst name="term">
        <str name="text">Douglas</str>
        <double name="strength">0.7453559924999298</double>
      </lst>
    </lst>
    <lst name="sentences">
      <lst name="sentence">
        <str name="text">Douglas said it was too early to project whether the anticipated fourth quarter performance would be "enough for us to exceed the prior year's overall earnings" of 1.53 dlrs a share.</str>
        <int name="index">4</int>
        <int name="start">533</int>
        <int name="end">715</int>
        <double name="strength">1.0</double>
        <int name="terms">29</int>
      </lst>
      <lst name="sentence">
        <str name="text">He repeated an earlier projection that third-quarter earnings "will probably be off slightly" from last year's 40 cts a share, falling in the range of 34 cts to 36 cts a share.</str>
        <int name="index">3</int>
        <int name="start">355</int>
        <int name="end">531</int>
        <double name="strength">0.999999999999999</double>
        <int name="terms">29</int>
      </lst>
      <lst name="sentence">
        <str name="text">President Howard Dean said in fiscal 1988 the company will derive  benefits of various dairy and frozen vegetable acquisitions from Ryan Milk to the Larsen Co.</str>
        <int name="index">6</int>
        <int name="start">847</int>
        <int name="end">1006</int>
        <double name="strength">0.9284766908852594</double>
        <int name="terms">25</int>
      </lst>
    </lst>
    <lst name="highlighted">
      <lst name="sentence">
        <str name="text">&lt;em&gt;Douglas&lt;/em&gt; &lt;em&gt;said&lt;/em&gt; it was too early to project whether &lt;em&gt;the&lt;/em&gt; anticipated fourth quarter performance would be "enough for us to exceed &lt;em&gt;the&lt;/em&gt; prior year's overall earnings" &lt;em&gt;of&lt;/em&gt; 1.53 dlrs a share.</str>
        <int name="index">4</int>
      </lst>
      <lst name="sentence">
        <str name="text">He repeated an earlier projection that third-quarter earnings "will probably be off slightly" &lt;em&gt;from&lt;/em&gt; last year's 40 cts a share, falling in &lt;em&gt;the&lt;/em&gt; range &lt;em&gt;of&lt;/em&gt; 34 cts to 36 cts a share.</str>
        <int name="index">3</int>
      </lst>
      <lst name="sentence">
        <str name="text">President Howard Dean &lt;em&gt;said&lt;/em&gt; in fiscal 1988 &lt;em&gt;the&lt;/em&gt; company will derive  benefits &lt;em&gt;of&lt;/em&gt; various dairy and frozen vegetable acquisitions &lt;em&gt;from&lt;/em&gt; Ryan Milk to &lt;em&gt;the&lt;/em&gt; Larsen Co.</str>
        <int name="index">6</int>
      </lst>
    </lst>
  </lst>
</lst>
Because the analyzer does not remove extraneous words, they form the most common theme words. But notice that "Douglas said" is a common phrase and these two words are in the top 5. Theme words are not common words, they are words which are used together. Theme sentences are effectively those with the most and strongest theme words.

The Summarizer tool can return the most important sentences, and can highlight the most important theme words. This can be used to show highlighted snippets in a UI.

Further Reading

Machine Learning
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Inverted Pyramid
Reuters 21578 text corpus

Next post: the overall results



No comments:

Post a Comment