Sunday, April 15, 2018

Fun with SVD: MNIST Digits

Singular Value Decomposition is a very powerful linear algebra operation. It factorizes a matrix into three matrices which have some interesting properties. Among other things, SVD gives you a ranking of how 'typical' or 'normal' a row or column is in comparison to the other columns. I created a 100x784 matrix of gray scale values, where each row is a sample image and each column is a position in the 28x28 gray-scale raster. The following chart gives the 100 digits ranked from "outlier" to "central", with the normalized rank above each digit.


By this ranking, fuzzy/smudgy images are outliers and cleaner lines are central. Or, 4's are central. Here's the code:

# variation of code from Tariq's book)# python notebook for Make Your Own Neural Network
# working with the MNIST data set
#
# (c) Tariq Rashid, 2016
# license is GPLv2

import numpy as np
import matplotlib.pyplot as py
%matplotlib inline

# strangely, does not exist in numpy
def normalize(v):
    max_v = -10000000
    min_v = 10000000
    for x in v:
        if (x > max_v):
            max_v = x
        if (x < min_v):
            min_v = x
    scale = 1.0/(max_v - min_v)
    offset = -min_v
    for i in range(len(v)):
        v[i] = (v[i] + offset) * scale 
    return v

# open the CSV file and read its contents into a list
data_file = open("mnist_dataset/mnist_train_100.csv", 'r')
data_list = data_file.readlines()
data_file.close()
rows = len(data_list)
image_mat = np.zeros((rows, 28 * 28))
for row in range(rows):
    dig = data_list[row][0]
    all_values = data_list[row].split(',')
    image_vector = np.asfarray(all_values[1:])
    image_mat[row] = (image_vector / 255.0 * 0.99) + 0.01
(u, s, v) = np.linalg.svd(image_mat)
row_features = normalize(u.dot(s))
# py.plot(np.sort(row_features))
keys = np.argsort(row_features)

grid=10
fig,axes = py.subplots(nrows=rows//grid, ncols=grid)
fig.set_figheight(15)
fig.set_figwidth(15)
py.subplots_adjust(top=1.1)
for row in range(rows):
    ax = axes[row//grid][row%grid]
    ax.set_title("{0:.2f}".format(row_features[keys[row]]), fontsize=10)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.imshow(image_mat[keys[row]].reshape(28,28), cmap='Greys', interpolation='None')
fig.savefig('foo.png', bbox_inches='tight')

Sunday, April 8, 2018

Document Summarization with LSA Appendix B: Software

The Test Harness

The scripts for running these tests and the data are in my github repo: https://github.com/LanceNorskog/lsa.

LSA toolkit

The LSA toolkit is available under https://github.com/LanceNorskog/lsa/tree/master/research. The Solr code using the LSA library is not yet published. It is a hacked-up terror intertwined with my port of OpenNLP to Solr. I plan to create a generic version of the Solr Summarizer that directly uses a Solr text type rather than its own custom implementation of OpenNLP POS filtering. The OpenNLP port for Lucene/Solr is available as LUCENE-2899.

Reuters Corpus

The Reuters data and scripts for this analysis project are under https://github.com/LanceNorskog/lsa/tree/master/reuters. ...../data/raw is the Reuters article corpus preprocessed: the articles are reformatted into one sentence per line and are limited to 10+ sentences. The toolkit includes a script to run against the Solr Document Summarizer and save the XML output for each article, and a script to apply XPath expressions to create a CSV line for each article into one CSV file per algorithm. The per-algorithm keys include both the regularization algorithms and whether parts-of-speech filtering was applied.

Analysis

The analysis phase used KNime to preprocess the CSV data. KNime rules created more columns which were calculated from the generated columns, and then to create pivot table which summarized the data per algorithm. This data was saved into a new CSV file. KNime's charting facilities are very limited, so I used an Excel script to generate the charts. Excel 2010 failed on my Mac, and I had to make the charts in LibreOffice instead, but then copy them into a DOC file in MS Word (and not LibreOffice!) to get just plain jpegs from the charts.

Further Reading

The KNime data analysis toolkit is my favorite tool for exploring numbers. It is a visual programming UI (based on the Eclipse platform) which allows you to hook up statistics jobs, file I/O, Java code scriptlets and interactive graphs. Highly recommended for amateurs and the occasional user who cannot remember all of R.

The R platform is a massive library for scientific computing. I did the SVD example charts on the first page with R. R is big; I suggest using RStudio and the 'R Project Template' software.

Document Summarization with LSA Part 0: Basics

This is an investigation into analyzing documents with Latent Semantic Analysis, or LSA. In particular, we will find the key sentences and tag clouds in a technical paper with Singular Value Decomposition, inspired by this paper: 
The application is add summaries to the LucidFind application, a search engine devoted to open-source text processing projects.

Core concepts

Document Corpus

A corpus is just a collection of documents. There are many standardized text document corpuses out there used a data in research, the way standardized white mice strains are used in laboratory research on RAs. In this case, we use the sentences in one document as the corpus for that document. The matrix has sentences for rows and terms for columns.

Term Vector

A term vector is a "bag of words" with numbers attached. We are going to take all of the words in the document corpus and create a matrix where documents are on one side and words are on the other, and add a value for each word where it occurs in a document. The standard ways are to add a count, or just set it to one. This is a matrix where the rows are term vectors for the documents. This gives a numerical representation of the document corpus which has all grammatical information ripped out. This Bayesian representation is surprisingly effective, as we will see.

Singular Value Decomposition

SVD is a linear algebra operation. Given a matrix, it creates three matrices based on the original matrix. The one we care about is the "left feature matrix", commonly called U in the literature. The rows are "document feature vectors", which represent a kind of importance for each document.  A feature, in machine learning, refers to signal in noisy data. Applied to a matrix of sentences v.s. terms, the document with the largest value in the left-hand column is the most "important" in the corpus. The remaining values in the row also contribute to importance.

Tying it together

LSA makes a bag-of-words matrix from the document corpus, and then tries to find the documents that most reflect the "essence" of the corpus. In this project we try to find the most essential sentences in a technical paper by doing a linear algebra analysis using both the sentences and individual words.