Uncle Lance's Ultra Whiz Bang: Document Summarization with LSA #5: Software

The Test Harness

The scripts for running these tests and the data are in my github repo: https://github.com/LanceNorskog/lsa.

LSA toolkit and Solr DocumentSummarizer class

The LSA toolkit is available under https://github.com/LanceNorskog/lsa/tree/master/research. The Solr code using the LSA library is not yet published. It is a hacked-up terror intertwined with my port of OpenNLP to Solr. I plan to create a generic version of the Solr Summarizer that directly uses a Solr text type rather than its own custom implementation of OpenNLP POS filtering. The OpenNLP port for Lucene/Solr is available as LUCENE-2899.

The Summarizer optionally uses OpenNLP to do sentence parsing and parts-of-speech analysis. It uses the OpenNLP parts-of-speech tool to filter for nouns and verbs, dropping all other words. Previous experiments used both raw sentences and sentences limited to nouns & verbs, and pos-stripped sentences worked 10-15% better in every algorithm combination. This set of benchmarks did not bother to try the full sentences.

Reuters Corpus

The Reuters data and scripts for this analysis project are under https://github.com/LanceNorskog/lsa/tree/master/reuters. ...../data/raw is the Reuters article corpus preprocessed: the articles are reformatted into one sentence per line and are limited to 10+ sentences. The toolkit includes a script to run against the Solr Document Summarizer and save the XML output for each article, and a script to apply XPath expressions to create a CSV line for each article into one CSV file per algorithm. The per-algorithm keys include both the regularization algorithms and whether parts-of-speech filtering was applied.

Analysis

The analysis phase used KNime to preprocess the CSV data. KNime rules created more columns which were calculated from the generated columns, and then to create pivot table which summarized the data per algorithm. This data was saved into a new CSV file. KNime's charting facilities are very limited, so I used an Excel script to generate the charts. Excel 2010 failed on my Mac, and I had to make the charts in LibreOffice instead, but then copy them into a DOC file in MS Word (and not LibreOffice!) to get just plain jpegs from the charts.

Uncle Lance's Ultra Whiz Bang

Thursday, September 6, 2012

Document Summarization with LSA #5: Software

The Test Harness

LSA toolkit and Solr DocumentSummarizer class

Reuters Corpus

Analysis

Further Reading

Next post: further tuning

No comments:

Post a Comment

Search This Blog