Thursday, September 6, 2012

Document Summarization with LSA #5: Software

The Test Harness

The scripts for running these tests and the data are in my github repo:

LSA toolkit and Solr DocumentSummarizer class

The LSA toolkit is available under The Solr code using the LSA library is not yet published. It is a hacked-up terror intertwined with my port of OpenNLP to Solr. I plan to create a generic version of the Solr Summarizer that directly uses a Solr text type rather than its own custom implementation of OpenNLP POS filtering. The OpenNLP port for Lucene/Solr is available as LUCENE-2899.

The Summarizer optionally uses OpenNLP to do sentence parsing and parts-of-speech analysis. It uses the OpenNLP parts-of-speech tool to filter for nouns and verbs, dropping all other words. Previous experiments used both raw sentences and sentences limited to nouns & verbs, and pos-stripped sentences worked 10-15% better in every algorithm combination. This set of benchmarks did not bother to try the full sentences.

Reuters Corpus

The Reuters data and scripts for this analysis project are under ...../data/raw is the Reuters article corpus preprocessed: the articles are reformatted into one sentence per line and are limited to 10+ sentences. The toolkit includes a script to run against the Solr Document Summarizer and save the XML output for each article, and a script to apply XPath expressions to create a CSV line for each article into one CSV file per algorithm. The per-algorithm keys include both the regularization algorithms and whether parts-of-speech filtering was applied.


The analysis phase used KNime to preprocess the CSV data. KNime rules created more columns which were calculated from the generated columns, and then to create pivot table which summarized the data per algorithm. This data was saved into a new CSV file. KNime's charting facilities are very limited, so I used an Excel script to generate the charts. Excel 2010 failed on my Mac, and I had to make the charts in LibreOffice instead, but then copy them into a DOC file in MS Word (and not LibreOffice!) to get just plain jpegs from the charts.

Further Reading

The KNime data analysis toolkit is my favorite tool for exploring numbers. It is a visual programming UI (based on the Eclipse platform) which allows you to hook up statistics jobs, file I/O, Java code scriptlets and interactive graphs. Highly recommended for amateurs and the occasional user who cannot remember all of R.

The R platform is a massive library for scientific computing. I did the SVD example charts on the first page with R. R is big; I suggest using RStudio and the 'R Project Template' software.

Next post: further tuning

No comments:

Post a Comment