Sunday, April 15, 2018

Fun with SVD: MNIST Digits

Singular Value Decomposition is a very powerful linear algebra operation. It factorizes a matrix into three matrices which have some interesting properties. Among other things, SVD gives you a ranking of how 'typical' or 'normal' a row or column is in comparison to the other columns. I created a 100x784 matrix of gray scale values, where each row is a sample image and each column is a position in the 28x28 gray-scale raster. The following chart gives the 100 digits ranked from "outlier" to "central", with the normalized rank above each digit.


By this ranking, fuzzy/smudgy images are outliers and cleaner lines are central. Or, 4's are central. Here's the code:

# variation of code from Tariq's book)# python notebook for Make Your Own Neural Network
# working with the MNIST data set
#
# (c) Tariq Rashid, 2016
# license is GPLv2

import numpy as np
import matplotlib.pyplot as py
%matplotlib inline

# strangely, does not exist in numpy
def normalize(v):
    max_v = -10000000
    min_v = 10000000
    for x in v:
        if (x > max_v):
            max_v = x
        if (x < min_v):
            min_v = x
    scale = 1.0/(max_v - min_v)
    offset = -min_v
    for i in range(len(v)):
        v[i] = (v[i] + offset) * scale 
    return v

# open the CSV file and read its contents into a list
data_file = open("mnist_dataset/mnist_train_100.csv", 'r')
data_list = data_file.readlines()
data_file.close()
rows = len(data_list)
image_mat = np.zeros((rows, 28 * 28))
for row in range(rows):
    dig = data_list[row][0]
    all_values = data_list[row].split(',')
    image_vector = np.asfarray(all_values[1:])
    image_mat[row] = (image_vector / 255.0 * 0.99) + 0.01
(u, s, v) = np.linalg.svd(image_mat)
row_features = normalize(u.dot(s))
# py.plot(np.sort(row_features))
keys = np.argsort(row_features)

grid=10
fig,axes = py.subplots(nrows=rows//grid, ncols=grid)
fig.set_figheight(15)
fig.set_figwidth(15)
py.subplots_adjust(top=1.1)
for row in range(rows):
    ax = axes[row//grid][row%grid]
    ax.set_title("{0:.2f}".format(row_features[keys[row]]), fontsize=10)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.imshow(image_mat[keys[row]].reshape(28,28), cmap='Greys', interpolation='None')
fig.savefig('foo.png', bbox_inches='tight')

Sunday, April 8, 2018

Document Summarization with LSA Appendix B: Software

The Test Harness

The scripts for running these tests and the data are in my github repo: https://github.com/LanceNorskog/lsa.

LSA toolkit

The LSA toolkit is available under https://github.com/LanceNorskog/lsa/tree/master/research. The Solr code using the LSA library is not yet published. It is a hacked-up terror intertwined with my port of OpenNLP to Solr. I plan to create a generic version of the Solr Summarizer that directly uses a Solr text type rather than its own custom implementation of OpenNLP POS filtering. The OpenNLP port for Lucene/Solr is available as LUCENE-2899.

Reuters Corpus

The Reuters data and scripts for this analysis project are under https://github.com/LanceNorskog/lsa/tree/master/reuters. ...../data/raw is the Reuters article corpus preprocessed: the articles are reformatted into one sentence per line and are limited to 10+ sentences. The toolkit includes a script to run against the Solr Document Summarizer and save the XML output for each article, and a script to apply XPath expressions to create a CSV line for each article into one CSV file per algorithm. The per-algorithm keys include both the regularization algorithms and whether parts-of-speech filtering was applied.

Analysis

The analysis phase used KNime to preprocess the CSV data. KNime rules created more columns which were calculated from the generated columns, and then to create pivot table which summarized the data per algorithm. This data was saved into a new CSV file. KNime's charting facilities are very limited, so I used an Excel script to generate the charts. Excel 2010 failed on my Mac, and I had to make the charts in LibreOffice instead, but then copy them into a DOC file in MS Word (and not LibreOffice!) to get just plain jpegs from the charts.

Further Reading

The KNime data analysis toolkit is my favorite tool for exploring numbers. It is a visual programming UI (based on the Eclipse platform) which allows you to hook up statistics jobs, file I/O, Java code scriptlets and interactive graphs. Highly recommended for amateurs and the occasional user who cannot remember all of R.

The R platform is a massive library for scientific computing. I did the SVD example charts on the first page with R. R is big; I suggest using RStudio and the 'R Project Template' software.

Document Summarization with LSA Part 0: Basics

This is an investigation into analyzing documents with Latent Semantic Analysis, or LSA. In particular, we will find the key sentences and tag clouds in a technical paper with Singular Value Decomposition, inspired by this paper: 
The application is add summaries to the LucidFind application, a search engine devoted to open-source text processing projects.

Core concepts

Document Corpus

A corpus is just a collection of documents. There are many standardized text document corpuses out there used a data in research, the way standardized white mice strains are used in laboratory research on RAs. In this case, we use the sentences in one document as the corpus for that document. The matrix has sentences for rows and terms for columns.

Term Vector

A term vector is a "bag of words" with numbers attached. We are going to take all of the words in the document corpus and create a matrix where documents are on one side and words are on the other, and add a value for each word where it occurs in a document. The standard ways are to add a count, or just set it to one. This is a matrix where the rows are term vectors for the documents. This gives a numerical representation of the document corpus which has all grammatical information ripped out. This Bayesian representation is surprisingly effective, as we will see.

Singular Value Decomposition

SVD is a linear algebra operation. Given a matrix, it creates three matrices based on the original matrix. The one we care about is the "left feature matrix", commonly called U in the literature. The rows are "document feature vectors", which represent a kind of importance for each document.  A feature, in machine learning, refers to signal in noisy data. Applied to a matrix of sentences v.s. terms, the document with the largest value in the left-hand column is the most "important" in the corpus. The remaining values in the row also contribute to importance.

Tying it together

LSA makes a bag-of-words matrix from the document corpus, and then tries to find the documents that most reflect the "essence" of the corpus. In this project we try to find the most essential sentences in a technical paper by doing a linear algebra analysis using both the sentences and individual words.

Thursday, September 6, 2012

Document Summarization with LSA Appendix: Tuning

Two problems

I found a couple of problems with this approach: very long sentences and running time & space.

Very Long Sentences

Other test data I looked at were technical papers. These often have very long sentences. The problem is not controlling for sentence length, but that there are too many theme words for the sentence to express one theme well.  One could try breaking these up into sentence shards: a 50-word sentence would become three 15-word shards. If we use parts-of-speech trimming, we can break the shards within runs of filler words.

Optimizations

It is possible to create an optimized version of LSA that only ranks sentences or terms via Random Indexing. RI deliberately throws away the identity of one side of the sentence/term matrix. It can rank sentences or terms, but not both. This algorithm runs much faster than the full sentence/term version, and is the next step in creating an interactive document summarizer or a high-quality tag cloud generator.

Random Projection

Random Projection is a wonderfully non-intuitive algorithm. If you multiply a matrix with another matrix filled with random numbers, the output matrix will have the same cosine distance ratios across all pairs of vectors. In a document-term matrix, all of the documents will still have the same mutual 'relevance' (measured by cosine distance which magically matches tf-idf relevance ranking).

If the random matrix is the (transposed) size of the original matrix, the row and column vectors will have the above distance property. If the random matrix has more or fewer dimensions on one side, the resulting matrix will retain the identity of vectors from the constant side, but will lose the identity of the vectors on the varying side. If you have a sentence-term matrix of 50 sentences x 1000 terms, and you multiply it by a random matrix of 100 x 50, you get a matrix of 50 sentences x 100 "sketch" vectors, where every sketch is a low-quality summarization of the term vectors. The 50 sentences will still have the same cosine ratios; we have merely thrown away the identity of the term vectors. Since the running time of SVD is very non-linear, we now have a much faster dataset that will give us the orthogonal decomposition of the sentences (but the terms are forgotten). We can invert this and calculate the SVD for terms without sentences.

Random Indexing

The above description posits creating the entire sentence/term matrix, then doing the complete multiplication. In fact, you can create the resulting sentence/sketch matrix directly, one term at a time. This will considerably cut memory usage and running time. I include this explanation because there are few online sources describing Random Indexing, and the following forgets to explain RI's roots in Random Projection.
http://www.sics.se/~mange/random_indexing.html
http://www.sics.se/~mange/papers/RI_intro.pdf

Fast Random Projection

Also see Achlioptas et. al. for a wicked speed-up of random indexing that is very suitable for densely populated data.
http://users.soe.ucsc.edu/~optas/papers/jl.pdf

Document Summarization with LSA #5: Software

The Test Harness

The scripts for running these tests and the data are in my github repo: https://github.com/LanceNorskog/lsa.

LSA toolkit and Solr DocumentSummarizer class

The LSA toolkit is available under https://github.com/LanceNorskog/lsa/tree/master/research. The Solr code using the LSA library is not yet published. It is a hacked-up terror intertwined with my port of OpenNLP to Solr. I plan to create a generic version of the Solr Summarizer that directly uses a Solr text type rather than its own custom implementation of OpenNLP POS filtering. The OpenNLP port for Lucene/Solr is available as LUCENE-2899.

The Summarizer optionally uses OpenNLP to do sentence parsing and parts-of-speech analysis. It uses the OpenNLP parts-of-speech tool to filter for nouns and verbs, dropping all other words. Previous experiments used both raw sentences and sentences limited to nouns & verbs, and pos-stripped sentences worked 10-15% better in every algorithm combination. This set of benchmarks did not bother to try the full sentences.

Reuters Corpus

The Reuters data and scripts for this analysis project are under https://github.com/LanceNorskog/lsa/tree/master/reuters. ...../data/raw is the Reuters article corpus preprocessed: the articles are reformatted into one sentence per line and are limited to 10+ sentences. The toolkit includes a script to run against the Solr Document Summarizer and save the XML output for each article, and a script to apply XPath expressions to create a CSV line for each article into one CSV file per algorithm. The per-algorithm keys include both the regularization algorithms and whether parts-of-speech filtering was applied.

Analysis

The analysis phase used KNime to preprocess the CSV data. KNime rules created more columns which were calculated from the generated columns, and then to create pivot table which summarized the data per algorithm. This data was saved into a new CSV file. KNime's charting facilities are very limited, so I used an Excel script to generate the charts. Excel 2010 failed on my Mac, and I had to make the charts in LibreOffice instead, but then copy them into a DOC file in MS Word (and not LibreOffice!) to get just plain jpegs from the charts.

Further Reading

The KNime data analysis toolkit is my favorite tool for exploring numbers. It is a visual programming UI (based on the Eclipse platform) which allows you to hook up statistics jobs, file I/O, Java code scriptlets and interactive graphs. Highly recommended for amateurs and the occasional user who cannot remember all of R.

The R platform is a massive library for scientific computing. I did the SVD example charts on the first page with R. R is big; I suggest using RStudio and the 'R Project Template' software.

Next post: further tuning


Document Summarization with LSA #4: Individual measurements

The Measurements

In this Post we review the individual measures. These charts show each measure applied to all the algorithms, and also the basis statistics summary of the full dataset (not the per-algorithm aggregates).  The measures are MRR, Rating, Sentence Length, and Non-zero. MRR is a common method for evaluating search results and other kinds of recommendations. The other three are fabricated for this analysis.

Mean Reciprocal Rank

MRR is a common measure for search results. It attempts to model the unforgiving way in which users react to mistakes in the order of search results. It measures the position of a preferred search result in a result list. If the "right" answer is third in the list, the MRR is 1/3. 

This statistic is the mean of three MRRs, one for each result based on how far it is from where it should be. If the second sentence is #2, that is a 1. If it is #1 or #3, that is 1/2. If the third is #3, that is a 1. If it is #2, that is 1/2 and if it is #1, that is 1/3. The measures go down to the 5th sentence.

Rating (0-5)

Rating is a heavily mutated form of "Precision@3". It tries to model how the user reacts in a UI that shows snippets of the top three recommended sentences. 0 means no interest. 1 means at least one sentence placed (the Non-zero measure). 2-5 measure how well the first three recommendations match the first and second spots. In detail:
5: first and second results are lede and subordinate
4: first result is lede
3: second result is lede
2: first or second are within first three sentences
1: third result is within first three sentences
0: anything else



MRR and Rating (green and yellow) correlate very consistently in the graphic on Post #3. Rating tracks with MRR, but is more extreme. Note the wider standard deviation. This indicates that the Rating formula is a good statistic for modeling unforgiving users.

 Sentence Length

This measures the mean sentence length of the top two recommended sentences. The sentence length is the number of nouns and verbs in the sentence, not the number of words. This indicates how well the algorithm compensates for the dominance of longer sentences.


Non-Zero

Every result that recommended the first, second or third sentence for one of the three top spots, by percentage.

The mean sentence length in the corpus is (I think) 22 sentences. A mean of 60 for 3 out of 22 is much better than random recommendations.

Precision v.s. Recall

In Information Retrieval jargon, precision is the accuracy of a ranking algorithm, and recall is the ability to find results. The three success measurements are precision measures. The "dartboard" measure is a recall measure.

From reading the actual sentences and recommendations, binary+normal and augNorm+normal had pretty good precision. These two also achieved the best recall at around 65%. This level would not be useful in a document summarization UI. I would pair this with a looser strategy to find related sentences by searching with the theme words. 

Previous Example

In the example in Post #2, the three top-rated sentences were 4, 3, and 6. Since only one of three made the cut, the rating algorithm gave this a three. Note that the example was not processed with Parts-of-Speech removal and used the binary algorithm, and still hit the dartboard. This article is the first in the dataset, and was chosen effectively at random.

Further reading


Next post: software details



Document Summarization with LSA #3: Preliminary results

Overall Analysis

Measures are tuned for Interactive UI

This implementation is targeted for interactive use in search engines. A search UI usually has the first few results shown in ranked order, with the option to go to the next few results.  This UI is intended to show the first three ranked sentences at the top of an entry with the theme words highlighted. Users are not forgiving of mistakes in these situations. The first result is much more important than the second, and so forth. People rarely click through to the second page.

The measures of effectiveness are formulated with this in mind. We used three:
  1. A variant of Mean Reciprocal Rank (MRR).
  2. "Rating" is a measure we created to model the user's behavior in a summarization UI. Our MMR variant and Rating are defined in the next Post. 
  3. Non-zero counts whether the algorithm placed any recommendations in the top three. "Did we even hit the dartboard?"
A separate problem is that sentences with more words can dominate the Sentence Length measures the length of the highest rated sentences. In this chart "Inverse Length" measures how well the algorithm countered the effects of sentence length.

Overall Comparison Chart


  • Key to algorithm names: "binary_normal" means that "binary" was used to create each cell, while "normal" multiplied each term vector with the mean normalized term vector. If there is no second key, the global value was 1. See post #1 for the full list of algorithms.
This display is a normalized version of the mean results for all 24 algorithm pairs, with four different measures. In all four, higher is better. "Inverse Length" means "how well it suppresses the length effect", "Rating" is the rating algorithm described above, "MRR" is our variant implementation of Mean Reciprocal Rank, and >0 is the number of results where any of the first three were in the first three sentences. None of these are absolutes, and the scales do not translate between measures. They simply show a relative ranking for each algorithm pair in the four measures: compare green to green, etc. The next post gives the detailed measurements in real units.

Grand Revelation

The grand revelation is: always normalize the term vector! All 5 local algorithms worked best with normal as the global algorithm. The binary function truncates the term counts to 1. Binary plus normalizing the term vector was by far the best in all three success measurements, and was middling in counteracting sentence length. AugNorm + normal was the highest achiever which compensates well for sentence length. TF + normal was the best overall for sentence length, but was only average for the three effectiveness measures.

Next post: detailed analysis