Uncle Lance's Ultra Whiz Bang

Deep Meter - #0b (Post-mortem on first attempt)

2018-10-27T19:12:00.000-07:00

Problems

The project has some serious flaws:

The symbol distribution has a severe "Zipf problem". The majority of syllables occur once in the corpus, which common ones appear tens of thousands of times. Tensorflow has a feature to counter this problem; I will try it out soon.
More important: the encoding itself. I wanted to keep it to a fixed number of syllables to avoid using sequential neural networks. However, the syllable format causes the above severe spread, and a huge dictionary size (15k for a larger corpus, 6k for this one). The CMUdict is itself in ARPAbet phonemes, and there are only 50. It will also have a severe Zipf, but should not have the ridiculously long tail. It will require a variable-size output, but it should take at most 10*4 phonemes to encode a ten-syllable sentence. That's the same information stored in 10*4*50 bits, instead of 10*15k bits.

It is possible to keep the current encoding and hash syllables from the long tail. If they are only syllables that are part of longer words, the decoder can hunt for words of the form 'syllable-?' or 'syllable-?-syllable' when turning 10 one-hots into a sentence. This feels like hacking instead of solving the problem well.

Not enough training data. There are many resources for well-formed sentences on the web. Just scanning any text for "10 syllables with the right meter" should flush out more sentences. Also, the "Paraphrase Database" project supplies millions of word&phrase pairs that mean the same thing. It should be possible to generate variations of a sentence by swapping in paraphrases that have a different meter.

Another option is to split larger sentences into clauses. This is very slow, can't really scale this to the sizes we need.

Storing training data. I tried pre-generating the USE vectors for my the sentences and it ran into the gigabytes quickly. This is we it re-generates the USE vector for each epoch. I believe this is the gating factor, since adding model layers did not slow it down appreciably. The Keras "read from directory" feature might be what I need. Not sure it will run faster from disk. That feature is designed for image processing.
Source data for short sentences is hard to find. The MSCOCO database is a good place to start, it has 5 summaries apiece for 29k images.
Evaluation functions: the loss function is wrong. There is no loss/eval function pair for "all 1s must be correct, all 0s must be 0, treating the two failures with equal weight".
Decoder of CMUdict- need to write the "[[syllable, weight], ...] -> word set" decoder, which searches for possible sentences and scores them based on the one-hot value for the given syllable inside each syllable slot.

Larger vision

The concept is to generate various autoencoders for different meters. Since the decoder phase has 3 hidden layers, it might be possible to freeze the first two, and swap in a separate final hidden and decoder weight set for each different meter. This is on the supposition that the inner layers store higher abstractions and the outer layers deal with word generation. Dubious, but worth trying.

And then find a host & make a website where you can interactively try it.

Deep Meter - #0a (Sneak Peek!)

2018-10-27T18:29:00.001-07:00

It's time to unveil my little toy project in Natural Language Processing (NLP). "Deep Meter" is a deep learning project which rephrases arbitrary English text in various poetic meters. The raw materials for this fever dream are as follows:
1) The Universal Sentence Encoder. This is a set of deep models which transform a clause, a sentence, or a paragraph into a "thought vector". That is, it turns the sentence "I hate my new iphone" into a set of 512 numbers that (very opaquely) encode these concepts: "recent purchase, mobile phone, strong negative sentiment, present tense". The USE also turns "This new mobile is killing me." into a different set of 512 numbers, but the cosine distance between the two vectors is very small. Since it encodes a set of concepts and not just a sequence of words, the USE could be the basis of an English-German translator. The USE is hosted and updated at Google in their "Tensorflow-Hub" model library project.
https://arxiv.org/abs/1803.11175
2) The Gutenberg Poetry Corpus. This is a conveniently packaged archive of most of the body text of every book in Product Gutenberg which is a compilation of poetry.
https://github.com/aparrish/gutenberg-poetry-corpus
3) The CMU Pronunciation Dictionary (CMUdict) is a database of common and uncommon English words, proper names, loanwords etc. which gives common pronunciations for each word. The pronunciations are given in the ARPAbet phoneme system. The entries are in fact in a variant of the ARPAbet that includes word stresses.
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
4) A version of the CMUdict which has syllable markers added. Used for early experiments. This is crucial to the meter classifier in #2.
https://webdocs.cs.ualberta.ca/~kondrak/cmudict.html
5) Tensorflow, Keras and Google Colaboratory
Tensorflow (TF) is a library (really an operating system packaged as a library) for doing machine learning. Keras is an abstract layer over TF and similar projects.
https://colab.research.google.com/notebooks/welcome.ipynb
6) An example notebook that wraps #1 in a convenient package for experimenting with the features of all items listed in #3.
https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/

Project:
The USE is distributed as only an encoder: it does not generate English sentences from its vectors.

This project creates a neural network that decodes vectors into sentences. The project plays a sneaky trick: it only trains the network on sentences which are in iambic pentameter. (What's i-p? Poetry in the stress format "du-DUH du-DUH du-DUH du-DUH du-DUH". Rhyme doesn't matter. Ten syllables with a hard rhythm is all that matters.) Since the network only knows how to output ten syllables in i-p form, and since the USE turns any sentence into an abstract thought vector, this network should be able to restate any short sentence (or sentence clause) in iambic pentameter.

Current status: I've written a few tools for data-wrangling (the bane of any machine learning project).

a library of utilities for parsing #4
code that reads lines of text and saves those lines which are in i-p and includes the syllable-ization according to CMUdict.
a Jupyter notebook (based on #6) that reads the above data.

The experiment is showing positive signs. The network does some interesting generation: it often finds the right word, or an associated word. Since it works by syllable, it has to generate syllables in sequence that together form words. In one case it found a three-syllable-sequence ('eternal').

These samples are cherry-picked for quality and illumination. There are a lot of failures. There's also a lot of stuttering, both of source words and synonyms, of a dominant theme of the line. (These themes are often common in the corpus: stories of ancient heroes etc.) Here are the original sentences, and the hand-interpreted output of some interesting successes. Capitalized words are loose syllables that didn't match any surrounding words.

Stuttering.

A spear the hero bore of wondrous strength
a mighty sword the spear a sword of spear

A common occurrence is single syllables that are clearly part of a word or a synonym. 'annoy' is the only word that matches NOY. And, synonyms are common.

And noise, and tumult rises from the crowd
and crowd a loud the NOY the in the air

It's like an infomercial.

Forgot, nutritious, grateful to the taste
and health the goodness sweet the LEE the taste

'Cheery' for 'happy'. Trying for 'country'.

A happy nation, and a happy king.
the cheery proud TREE and his happy state

'AR' - army. 'joust' could be a cool association with army.

Of Greeks a mighty army, all in vain
of all the AR the Greeks the joust of Greeks

'SPIH' is spirit. 'TER' part of 'eternal', which it got correctly later in the sentence. This was the only 3-syllable success.

With this eternal silence; more a god
with TER IH SPIH IH god eternal god

Both end in 'DURE', it wants 'endure' for 'anguish' and 'torment'. WIH as in 'with', DIH is as in 'did'.

'And suffer me in anguish to depart'
and leave for WIH and ang in EH DIH DUR
Cannot devise a torment, so it be
cannot a not by by it by DIH DURE

In short, many examples of finding a two-syllable word that is either in place or an associated word (synonym or more distant association). One example of a three-syllable word.

Google Colaboratory is a free online service that hosts Jupyter Notebooks and includes a free GPU (badly needed!). You have to sign up for Colab first, and then you can open any Jupyter notebook file from your Google Drive. The Deep Meter notebook here checks out the github project and uses the CMUdict code and cached Project Gutenberg poetry files that I classified by meter. If you sign up for Colab and upload this notebook from this branch, it might actually work. On a GPU runtime it takes maybe 1/2 hour to train. The VMs can be very slow, but GPU speed does not suffer. Don't try it on CPU or TPU, it will take forever!

If you have your own GPU farm, the only Colaboratory-specific code is the github check-out directory dance at the top. (A Colab notebook starts out in /content on a virtual machine). Everything else should be reproducible. The code uses a cached version of CMUdict.

https://github.com/LanceNorskog/deep_meter/tree/First-blog-post

Backpropagation

2018-09-23T17:28:00.001-07:00

Backpropagation

UAT

The Universal Approximation Theorem (UAT) states that the simplest version of the structure of weights & non-linear functions used in deep learning. The UAT states roughly that this equation can be solved, to an approximation:

logit(numbers[] * input_weights[][]) * hidden_weights[][]) ~= other_numbers[]

Backpropagation

The UAT does not tell how to do this, only that it is possible. Later, the backpropagation (BP) algorithm was created to realize the possibility of the UAT. BP, as created, is only intended to solve one equation of N inputs and M unknowns. Backprogagation uses "hill-climing", which assumes that the solution of an equation is at the top of a hill. Wherever it starts, if it climbs uphill, it will eventually hit the top.

In a typical Deep Learning application, we take thousands of input samples which are similar, and turn them into equations of the above form. For a simple image classifier, this would be typical. The right-hand size corresponds to whether the image is a picture of a cat, a building or a car.

logit(image_pixels[] * input_weight_matrix[]) * hidden_weight_matrix[] ~= [1,0,0]

Wait a minute! In the math of equation-solving, each of these equations is a separate subspace, to be solved independently.

How did we get from approximating an equation (UAT) to approximating thousands of equations? The key idea of Deep Learning is to pretend that all of these equations are in one subspace, and that there is a mythical equation that is the centroid of all of these equations.

Deep Learning takes many (thousand) sample equations which are mathematically different subspaces, and essentially averages the result of the application of BP to each equation. To put it another way, each equation tugs the BP algorithm toward itself.

This seems like a misuse of BP- a misuse that has proven very very fruitful.

References

See the wikipedia pages for more info on the UAT, Backpropagation, and Gradient Descent:

https://en.wikipedia.org/wiki/Universal_approximation_theorem
https://en.wikipedia.org/wiki/Backpropagation
https://en.wikipedia.org/wiki/Gradient_descent

Neural Networks Series #1: What's an SVD?

2018-09-23T16:17:00.000-07:00

Singular Value Decomposition (SVD) is a linear algebra algorithm that decomposes almost any matrix of numbers into 3 matrices, that when multiplied together will recreate the original matrix. There are several of these matrix decomposition techniques, useful for different purposes.
Before we delve into the uses of SVD, we will take a short detour. The rank of a matrix is the number of linearly separable rows (or columns) in the matrix, which means that:
if (row[i] + x) * y + z = row[j]
then rows i and j are not linearly separable
Computing the rank is done by mutating the matrix into row echelon form, which substitutes a row of zeros for all of the rows which are considered "duplicates" under linear separability. Here is a tutorial on row echelon form:
https://stattrek.com/matrix-algebra/echelon-transform.aspx#MatrixA
This matrix has a rank of 1 because it is possible to generate any 3 of the rows from the fourth row with the above formula, and to generate the remaining columns:
[1,1,1,1]
[2,2,2,2]
[3,3,3,3]
[4,4,4,4]
This is called linear separability. Rank gives a primitive measurement for the amount of correlation between rows or columns. SVD is more nuanced. It implements linear separability as a continuous measurement whereas rank is only a discrete measurement. Separability is scored by Euclidean distance, and SVD creates a relative ranking of how close each row is to all other rows, and how close each column is to every other column. This ranking gives a couple of useful measurements about the matrix data:

It gives a more nuanced measurement of separability than just the rank. It gives a vector of numbers which give the relative amount of separability across multiple columns.
By ranking the distances among rows & columns, it gives a scoring of centrality, or "normal vs outlier".

This matrix has a rank of 2 because while it is not possible to generate matching rows or columns, they are close:

[1,2,3,4]

[2,3,4,5]

[3,4,5,6]

[4,5,6,7]

SVD generates a list of numbers called the "singular values" of a matrix. Here are the singular values for the above 2 matrices:

[1.00000000e+00 8.59975057e-17 2.52825014e-47 3.94185361e-79]

[9.98565020e-01 5.35527830e-02 1.96745978e-17 9.98144122e-18]
Let's chart these singular value vectors (linear scale):


















These charts tell us that the first matrix has a rank of 1. The second matrix has a rank above one, but does not have two full rows of uniqueness. This chart is a measurement of the correlation between rows and correlation between columns. The singular values of a matrix are a good first look at the amount of "information" or "signal" or "entropy". "Quantity of signal" is a fairly slippery concept with mathematical definitions that border on the metaphysical, but the singular values are a useful (though simplistic) measurement of the amount of information contained in a matrix.
Based on the term "entropy" in the previous paragraph, let's try filling a larger matrix with random numbers. This is the measurement of a 20x50 random matrix, with a chart in log scaled on the Y axis:
Rank = 20
Singular values: 
[0.35032972 0.322733   0.30419963 0.29269556 0.28137789 0.26201005
 0.25742231 0.23075557 0.21945011 0.21763414 0.20245292 0.1976021
 0.175207   0.16692658 0.14884194 0.14208821 0.13474842 0.12291441
 0.10004981 0.08849001]






The graph should be a straight line but is not, due mainly to the fact "random numbers" are very hard to come by. This matrix has rank of 20- you cannot multiply & add any row and get another row. In fact, all rows are equidistant under Euclidean distance. Let's do something sneaky: we're going multiply it by itself to get a 50x50 matrix of random numbers. Here's are the measurements (again, a log-Y chart):



Rank = 20
[1.14271040e+02 9.69770480e+01 8.61587896e+01 7.97653909e+01
 7.37160653e+01 6.39172571e+01 6.16984957e+01 4.95777248e+01
 4.48387835e+01 4.40997633e+01 3.81619300e+01 3.63550994e+01
 2.85815085e+01 2.59437792e+01 2.06268491e+01 1.87974231e+01
 1.69055603e+01 1.40665565e+01 9.31997487e+00 7.29072452e+00
 1.50274491e-14 1.29473245e-14 1.24897761e-14 1.11334989e-14
 1.03937520e-14 9.90176181e-15 9.17746567e-15 8.94074462e-15
 8.58805508e-15 7.94287894e-15 6.90962668e-15 6.72135898e-15
 6.02639348e-15 5.66356533e-15 5.16429606e-15 4.84762468e-15
 4.47711092e-15 4.37266408e-15 4.27780206e-15 3.43884568e-15
 3.21440817e-15 2.83815151e-15 2.61710432e-15 2.34685343e-15
 2.24348424e-15 1.32915906e-15 9.12392110e-16 5.65320744e-16
 2.64346262e-16 1.11941235e-16]


Huh? Why is the rank still 20? And what's going on with that chart? When we multiplied the matrix by itself, we did not add any information/signal/entropy to the matrix! Therefore it still has the same amount of separability. It's as if we weighed a balloon, filled it with weightless air, then weighed it again.
In this post, we discussed SVD and the measurement of information in a matrix; this trick of multiplying a random matrix by itself to blow it up was what helped me understand SVD ten years ago. (Thanks to Ted Dunning on the Mahout mailing list.)

I've fast-forwarded like mad through days of linear algebra lecture, in order to give intuitive understanding about matrix analysis. We'll be using SVD in a few ways in this series on my explorations in Neural Networks. See this page for a good explanation of SVD and a cool example of how it can be used to clean up photographs. :

http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Source code for the above charts & values available at:

https://github.com/LanceNorskog/blob_posts/blob/master/SVD%20Basics.ipynb

Fun with SVD: MNIST Digits

2018-04-15T18:36:00.000-07:00

Singular Value Decomposition is a very powerful linear algebra operation. It factorizes a matrix into three matrices which have some interesting properties. Among other things, SVD gives you a ranking of how 'typical' or 'normal' a row or column is in comparison to the other columns. I created a 100x784 matrix of gray scale values, where each row is a sample image and each column is a position in the 28x28 gray-scale raster. The following chart gives the 100 digits ranked from "outlier" to "central", with the normalized rank above each digit.

By this ranking, fuzzy/smudgy images are outliers and cleaner lines are central. Or, 4's are central. Here's the code:

# variation of code from Tariq's book)# python notebook for Make Your Own Neural Network
# working with the MNIST data set
#
# (c) Tariq Rashid, 2016
# license is GPLv2

import numpy as np

import matplotlib.pyplot as py

%matplotlib inline

# strangely, does not exist in numpy

def normalize(v):

max_v = -10000000

min_v = 10000000

for x in v:

if (x > max_v):

max_v = x

if (x < min_v):

min_v = x

scale = 1.0/(max_v - min_v)

offset = -min_v

for i in range(len(v)):

v[i] = (v[i] + offset) * scale

return v

# open the CSV file and read its contents into a list

data_file = open("mnist_dataset/mnist_train_100.csv", 'r')

data_list = data_file.readlines()

data_file.close()

rows = len(data_list)

image_mat = np.zeros((rows, 28 * 28))

for row in range(rows):

dig = data_list[row][0]

all_values = data_list[row].split(',')

image_vector = np.asfarray(all_values[1:])

image_mat[row] = (image_vector / 255.0 * 0.99) + 0.01

(u, s, v) = np.linalg.svd(image_mat)

row_features = normalize(u.dot(s))

# py.plot(np.sort(row_features))

keys = np.argsort(row_features)

grid=10

fig,axes = py.subplots(nrows=rows//grid, ncols=grid)

fig.set_figheight(15)

fig.set_figwidth(15)

py.subplots_adjust(top=1.1)

for row in range(rows):

ax = axes[row//grid][row%grid]

ax.set_title("{0:.2f}".format(row_features[keys[row]]), fontsize=10)

ax.set_xticks([])

ax.set_yticks([])

ax.imshow(image_mat[keys[row]].reshape(28,28), cmap='Greys', interpolation='None')

fig.savefig('foo.png', bbox_inches='tight')

Document Summarization with LSA Appendix B: Software

2018-04-08T19:19:00.001-07:00

The Test Harness

The scripts for running these tests and the data are in my github repo: https://github.com/LanceNorskog/lsa.

LSA toolkit

Reuters Corpus

The Reuters data and scripts for this analysis project are under https://github.com/LanceNorskog/lsa/tree/master/reuters. ...../data/raw is the Reuters article corpus preprocessed: the articles are reformatted into one sentence per line and are limited to 10+ sentences. The toolkit includes a script to run against the Solr Document Summarizer and save the XML output for each article, and a script to apply XPath expressions to create a CSV line for each article into one CSV file per algorithm. The per-algorithm keys include both the regularization algorithms and whether parts-of-speech filtering was applied.

Analysis

The analysis phase used KNime to preprocess the CSV data. KNime rules created more columns which were calculated from the generated columns, and then to create pivot table which summarized the data per algorithm. This data was saved into a new CSV file. KNime's charting facilities are very limited, so I used an Excel script to generate the charts. Excel 2010 failed on my Mac, and I had to make the charts in LibreOffice instead, but then copy them into a DOC file in MS Word (and not LibreOffice!) to get just plain jpegs from the charts.

Document Summarization with LSA Part 0: Basics

2018-04-08T19:19:00.000-07:00

This is an investigation into analyzing documents with Latent Semantic Analysis, or LSA. In particular, we will find the key sentences and tag clouds in a technical paper with Singular Value Decomposition, inspired by this paper:

Creating Generic Text Summaries

The application is add summaries to the LucidFind application, a search engine devoted to open-source text processing projects.

Core concepts

Document Corpus

A corpus is just a collection of documents. There are many standardized text document corpuses out there used a data in research, the way standardized white mice strains are used in laboratory research on RAs. In this case, we use the sentences in one document as the corpus for that document. The matrix has sentences for rows and terms for columns.

Term Vector

A term vector is a "bag of words" with numbers attached. We are going to take all of the words in the document corpus and create a matrix where documents are on one side and words are on the other, and add a value for each word where it occurs in a document. The standard ways are to add a count, or just set it to one. This is a matrix where the rows are term vectors for the documents. This gives a numerical representation of the document corpus which has all grammatical information ripped out. This Bayesian representation is surprisingly effective, as we will see.

Singular Value Decomposition

SVD is a linear algebra operation. Given a matrix, it creates three matrices based on the original matrix. The one we care about is the "left feature matrix", commonly called U in the literature. The rows are "document feature vectors", which represent a kind of importance for each document. A feature, in machine learning, refers to signal in noisy data. Applied to a matrix of sentences v.s. terms, the document with the largest value in the left-hand column is the most "important" in the corpus. The remaining values in the row also contribute to importance.

Tying it together

LSA makes a bag-of-words matrix from the document corpus, and then tries to find the documents that most reflect the "essence" of the corpus. In this project we try to find the most essential sentences in a technical paper by doing a linear algebra analysis using both the sentences and individual words.

Document Summarization with LSA Appendix: Tuning

2012-09-06T04:42:00.001-07:00

Two problems

I found a couple of problems with this approach: very long sentences and running time & space.

Very Long Sentences

Other test data I looked at were technical papers. These often have very long sentences. The problem is not controlling for sentence length, but that there are too many theme words for the sentence to express one theme well. One could try breaking these up into sentence shards: a 50-word sentence would become three 15-word shards. If we use parts-of-speech trimming, we can break the shards within runs of filler words.

Optimizations

It is possible to create an optimized version of LSA that only ranks sentences or terms via Random Indexing. RI deliberately throws away the identity of one side of the sentence/term matrix. It can rank sentences or terms, but not both. This algorithm runs much faster than the full sentence/term version, and is the next step in creating an interactive document summarizer or a high-quality tag cloud generator.

Random Projection

Random Projection is a wonderfully non-intuitive algorithm. If you multiply a matrix with another matrix filled with random numbers, the output matrix will have the same cosine distance ratios across all pairs of vectors. In a document-term matrix, all of the documents will still have the same mutual 'relevance' (measured by cosine distance which magically matches tf-idf relevance ranking).

If the random matrix is the (transposed) size of the original matrix, the row and column vectors will have the above distance property. If the random matrix has more or fewer dimensions on one side, the resulting matrix will retain the identity of vectors from the constant side, but will lose the identity of the vectors on the varying side. If you have a sentence-term matrix of 50 sentences x 1000 terms, and you multiply it by a random matrix of 100 x 50, you get a matrix of 50 sentences x 100 "sketch" vectors, where every sketch is a low-quality summarization of the term vectors. The 50 sentences will still have the same cosine ratios; we have merely thrown away the identity of the term vectors. Since the running time of SVD is very non-linear, we now have a much faster dataset that will give us the orthogonal decomposition of the sentences (but the terms are forgotten). We can invert this and calculate the SVD for terms without sentences.

Random Indexing

The above description posits creating the entire sentence/term matrix, then doing the complete multiplication. In fact, you can create the resulting sentence/sketch matrix directly, one term at a time. This will considerably cut memory usage and running time. I include this explanation because there are few online sources describing Random Indexing, and the following forgets to explain RI's roots in Random Projection.
http://www.sics.se/~mange/random_indexing.html
http://www.sics.se/~mange/papers/RI_intro.pdf

Fast Random Projection

Also see Achlioptas et. al. for a wicked speed-up of random indexing that is very suitable for densely populated data.
http://users.soe.ucsc.edu/~optas/papers/jl.pdf

Document Summarization with LSA #5: Software

2012-09-06T04:41:00.002-07:00

The Test Harness

The scripts for running these tests and the data are in my github repo: https://github.com/LanceNorskog/lsa.

LSA toolkit and Solr DocumentSummarizer class

The LSA toolkit is available under https://github.com/LanceNorskog/lsa/tree/master/research. The Solr code using the LSA library is not yet published. It is a hacked-up terror intertwined with my port of OpenNLP to Solr. I plan to create a generic version of the Solr Summarizer that directly uses a Solr text type rather than its own custom implementation of OpenNLP POS filtering. The OpenNLP port for Lucene/Solr is available as LUCENE-2899.

The Summarizer optionally uses OpenNLP to do sentence parsing and parts-of-speech analysis. It uses the OpenNLP parts-of-speech tool to filter for nouns and verbs, dropping all other words. Previous experiments used both raw sentences and sentences limited to nouns & verbs, and pos-stripped sentences worked 10-15% better in every algorithm combination. This set of benchmarks did not bother to try the full sentences.

Reuters Corpus

Analysis

Next post: further tuning

Document Summarization with LSA #4: Individual measurements

2012-09-06T04:41:00.000-07:00

The Measurements

In this Post we review the individual measures. These charts show each measure applied to all the algorithms, and also the basis statistics summary of the full dataset (not the per-algorithm aggregates). The measures are MRR, Rating, Sentence Length, and Non-zero. MRR is a common method for evaluating search results and other kinds of recommendations. The other three are fabricated for this analysis.

Mean Reciprocal Rank

MRR is a common measure for search results. It attempts to model the unforgiving way in which users react to mistakes in the order of search results. It measures the position of a preferred search result in a result list. If the "right" answer is third in the list, the MRR is 1/3.

This statistic is the mean of three MRRs, one for each result based on how far it is from where it should be. If the second sentence is #2, that is a 1. If it is #1 or #3, that is 1/2. If the third is #3, that is a 1. If it is #2, that is 1/2 and if it is #1, that is 1/3. The measures go down to the 5th sentence.

Rating (0-5)

Rating is a heavily mutated form of "Precision@3". It tries to model how the user reacts in a UI that shows snippets of the top three recommended sentences. 0 means no interest. 1 means at least one sentence placed (the Non-zero measure). 2-5 measure how well the first three recommendations match the first and second spots. In detail:

5: first and second results are lede and subordinate
4: first result is lede
3: second result is lede
2: first or second are within first three sentences
1: third result is within first three sentences
0: anything else

MRR and Rating (green and yellow) correlate very consistently in the graphic on Post #3. Rating tracks with MRR, but is more extreme. Note the wider standard deviation. This indicates that the Rating formula is a good statistic for modeling unforgiving users.

Sentence Length

This measures the mean sentence length of the top two recommended sentences. The sentence length is the number of nouns and verbs in the sentence, not the number of words. This indicates how well the algorithm compensates for the dominance of longer sentences.

Non-Zero

Every result that recommended the first, second or third sentence for one of the three top spots, by percentage.

The mean sentence length in the corpus is (I think) 22 sentences. A mean of 60 for 3 out of 22 is much better than random recommendations.

Precision v.s. Recall

In Information Retrieval jargon, precision is the accuracy of a ranking algorithm, and recall is the ability to find results. The three success measurements are precision measures. The "dartboard" measure is a recall measure.

From reading the actual sentences and recommendations, binary+normal and augNorm+normal had pretty good precision. These two also achieved the best recall at around 65%. This level would not be useful in a document summarization UI. I would pair this with a looser strategy to find related sentences by searching with the theme words.

Previous Example

In the example in Post #2, the three top-rated sentences were 4, 3, and 6. Since only one of three made the cut, the rating algorithm gave this a three. Note that the example was not processed with Parts-of-Speech removal and used the binary algorithm, and still hit the dartboard. This article is the first in the dataset, and was chosen effectively at random.

Next post: software details

Document Summarization with LSA #3: Preliminary results

2012-09-06T04:34:00.001-07:00

Overall Analysis

Measures are tuned for Interactive UI

This implementation is targeted for interactive use in search engines. A search UI usually has the first few results shown in ranked order, with the option to go to the next few results. This UI is intended to show the first three ranked sentences at the top of an entry with the theme words highlighted. Users are not forgiving of mistakes in these situations. The first result is much more important than the second, and so forth. People rarely click through to the second page.

The measures of effectiveness are formulated with this in mind. We used three:

A variant of Mean Reciprocal Rank (MRR).
"Rating" is a measure we created to model the user's behavior in a summarization UI. Our MMR variant and Rating are defined in the next Post.
Non-zero counts whether the algorithm placed any recommendations in the top three. "Did we even hit the dartboard?"

A separate problem is that sentences with more words can dominate the Sentence Length measures the length of the highest rated sentences. In this chart "Inverse Length" measures how well the algorithm countered the effects of sentence length.

Overall Comparison Chart

Key to algorithm names: "binary_normal" means that "binary" was used to create each cell, while "normal" multiplied each term vector with the mean normalized term vector. If there is no second key, the global value was 1. See post #1 for the full list of algorithms.

This display is a normalized version of the mean results for all 24 algorithm pairs, with four different measures. In all four, higher is better. "Inverse Length" means "how well it suppresses the length effect", "Rating" is the rating algorithm described above, "MRR" is our variant implementation of Mean Reciprocal Rank, and >0 is the number of results where any of the first three were in the first three sentences. None of these are absolutes, and the scales do not translate between measures. They simply show a relative ranking for each algorithm pair in the four measures: compare green to green, etc. The next post gives the detailed measurements in real units.

Grand Revelation

The grand revelation is: always normalize the term vector! All 5 local algorithms worked best with normal as the global algorithm. The binary function truncates the term counts to 1. Binary plus normalizing the term vector was by far the best in all three success measurements, and was middling in counteracting sentence length. AugNorm + normal was the highest achiever which compensates well for sentence length. TF + normal was the best overall for sentence length, but was only average for the three effectiveness measures.

Next post: detailed analysis

Document Summarization with LSA #2: Test with newspaper articles

2012-09-06T04:33:00.003-07:00

The Experiment

This analysis evaluates many variants of the LSA algorithm against some measurements appropriate for an imaginary document summarization UI. This UI displays the most important two sentences with the important theme words highlighted. The measurements try to match the expectations of the user.

Supervised v.s. Unsupervised Learning

Machine Learning algorithms are classified as supervised, unsupervised, and semi-supervised.
A supervised algorithm creates a model (usually statistical) from training data, then applies test data against the model. An unsupervised algorithm is applied directly against test data without a pre-created model. A semi-supervised algorithm uses tagged and untagged data; this works surprisingly well in a lot of contexts.

The Data

We are going to use unsupervised learning. The test data is a corpus of newspaper articles called Reuters-21758, which was collected and published by the Reuters news agency to assist text analysis research. This dataset is not really tagged, but is appropriate for this experiment. Newspaper articles are written in a particular style which is essentially pre-summarized. In a well-written newspaper article, the first sentence (the lede) is the most important sentence, and the second sentence is complementary and usually shares few words with the first. The rest of the article is usually structured in order from abstraction to detail, called the Inverted Pyramid form. And, newspaper articles are the right length to summarize effectively.

Example Data

We limited the tests to sentences which were from 15-75 sentences long. The entire corpus is 21 thousand articles. This limits the test space to just under 2000. Here is a sample newspaper article:

26-FEB-1987 15:26:26.78

DEAN FOODS <DF> SEES STRONG 4TH QTR EARNINGS

Dean Foods Co expects earnings for the fourth quarter ending May 30 to exceed those of the same year-ago period, Chairman Kenneth Douglas told analysts. In the fiscal 1986 fourth quarter the food processor reported earnings of 40 cts a share. Douglas also said the year's sales should exceed 1.4 billion dlrs, up from 1.27 billion dlrs the prior year. He repeated an earlier projection that third-quarter earnings "will probably be off slightly" from last year's 40 cts a share, falling in the range of 34 cts to 36 cts a share. Douglas said it was too early to project whether the anticipated fourth quarter performance would be "enough for us to exceed the prior year's overall earnings" of 1.53 dlrs a share. In 1988, Douglas said Dean should experience "a 20 pct improvement in our bottom line from effects of the tax reform act alone."

President Howard Dean said in fiscal 1988 the company will derive benefits of various dairy and frozen vegetable acquisitions from Ryan Milk to the Larsen Co. Dean also said the company will benefit from its acquisition in late December of Elgin Blenders Inc, West Chicago. He said the company is a major shareholder of E.B.I. Foods Ltd, a United Kingdom blender, and has licensing arrangements in Australia, Canada, Brazil and Japan. "It provides an entry to McDonalds Corp <MCD> we've been after for years," Douglas told analysts. Reuter

As you can see, the text matches the concept of the inverted pyramid. The first two sentences are complementary, have no repeated words, and few words in common. Repeated concepts are described with different words: "Dean Foods Co" in the first sentence is echoed as "the food processor" in the second, while "expects earnings" is matched by "recorded earnings". This style seems real-world enough to be good test data for this algorithm suite. We did not comb the data for poorly structured articles or garbled text.

The Code

There are two bits of code involved: Singular Value Decomposition (explained previously) and Matrix "Regularization", or "Conditioning". The latter refers to applying non-linear algorithms to the document-term matrix which make the data somewhat more amenable to analysis. Several algorithms have been investigated for this purpose.

The raw term counts data supplied by a document/term matrix may not always be the right way to approach the data. Matrix Regularization algorithms are functions which use the entire dataset to affect each cell in the matrix. The contents of a document/term matrix after regularization are referred to as "term weights".

There are two classes of algorithm for creating term weights. Local algorithms alter each cell, while global algorithms create a global vector of values per term that are applied to all documents with that term, and likewise a global vector of values per document which is applied to all terms in that document. Local algorithms include term frequency (tf) which uses the raw matrix, binary which replaces each term count greater than one with a one, and some others which find a unique value for a cell based on the document and term vectors which cross at that cell.

Global algorithms for term vectors include normalizing the term vector and finding the inverse document frequency (idf) of the term. The LSA implementation includes implementations of these local and global algorithms, and any pair can be used together. Thus, tf-idf is achieved by combining the local tf algorithm and the global idf algorithm. The literature recommends a few of these combinations (tf-idf and log-entropy) as the most effective, but this test found a few other combinations which were superior. For document vectors, cosine normalization and a new "pivoted cosine" normalization are recommended for counteracting the dominance of longer sentences. These are not yet implemented. Existing combinations of local and term vector algorithms do a good job of suppressing sentence length problems. We will see later on that:

one of the term vector algorithms is by far the best at everything,
one of the local algorithms is the overall winner, and
one of the others does a fantastic job at sentence length problems but is an otherwise average performer.

The Result

The above article gave the following result for the 'binary' algorithm:

<lst name="analysis">
<lst name="summary">
<lst name="stats">
<lst name="sentences">
<int name="count">12</int>
</lst>
<lst name="terms">
<int name="count">150</int>
</lst>
<lst name="stems">
<int name="count">150</int>
</lst>
</lst>
<lst name="terms">
<lst name="term">
<str name="text">the</str>
<double name="strength">1.0</double>
</lst>
<lst name="term">
<str name="text">of</str>
<double name="strength">0.942809041582063</double>
</lst>
<lst name="term">
<str name="text">said</str>
<double name="strength">0.8164965809277259</double>
</lst>
<lst name="term">
<str name="text">from</str>
<double name="strength">0.7453559924999301</double>
</lst>
<lst name="term">
<str name="text">Douglas</str>
<double name="strength">0.7453559924999298</double>
</lst>
</lst>
<lst name="sentences">
<lst name="sentence">
<str name="text">Douglas said it was too early to project whether the anticipated fourth quarter performance would be "enough for us to exceed the prior year's overall earnings" of 1.53 dlrs a share.</str>
<int name="index">4</int>
<int name="start">533</int>
<int name="end">715</int>
<double name="strength">1.0</double>
<int name="terms">29</int>
</lst>
<lst name="sentence">
<str name="text">He repeated an earlier projection that third-quarter earnings "will probably be off slightly" from last year's 40 cts a share, falling in the range of 34 cts to 36 cts a share.</str>
<int name="index">3</int>
<int name="start">355</int>
<int name="end">531</int>
<double name="strength">0.999999999999999</double>
<int name="terms">29</int>
</lst>
<lst name="sentence">
<str name="text">President Howard Dean said in fiscal 1988 the company will derive benefits of various dairy and frozen vegetable acquisitions from Ryan Milk to the Larsen Co.</str>
<int name="index">6</int>
<int name="start">847</int>
<int name="end">1006</int>
<double name="strength">0.9284766908852594</double>
<int name="terms">25</int>
</lst>
</lst>
<lst name="highlighted">
<lst name="sentence">
<str name="text">Douglas said it was too early to project whether the anticipated fourth quarter performance would be "enough for us to exceed the prior year's overall earnings" of 1.53 dlrs a share.</str>
<int name="index">4</int>
</lst>
<lst name="sentence">
<str name="text">He repeated an earlier projection that third-quarter earnings "will probably be off slightly" from last year's 40 cts a share, falling in the range of 34 cts to 36 cts a share.</str>
<int name="index">3</int>
</lst>
<lst name="sentence">
<str name="text">President Howard Dean said in fiscal 1988 the company will derive benefits of various dairy and frozen vegetable acquisitions from Ryan Milk to the Larsen Co.</str>
<int name="index">6</int>
</lst>
</lst>
</lst>
</lst>

Because the analyzer does not remove extraneous words, they form the most common theme words. But notice that "Douglas said" is a common phrase and these two words are in the top 5. Theme words are not common words, they are words which are used together. Theme sentences are effectively those with the most and strongest theme words.

The Summarizer tool can return the most important sentences, and can highlight the most important theme words. This can be used to show highlighted snippets in a UI.

Next post: the overall results

Document Summarization with LSA #1: Introduction

2012-09-06T04:33:00.002-07:00

Document Summarization with LSA

This is a several-part series on document summarization using Latent Semantic Analysis (LSA). I wrote a document summarizer and did an exhaustive measurement pass using it to summarize newspaper articles from the first Reuters corpus. The code is structured as a web service in Solr, using Lucene for text analysis and the OpenNLP package for tuning the algorithm with Parts-of-Speech analysis.

Introduction

Document summarization is about finding the "themes" in a document: the important words and sentences that contain the core concepts. There are many algorithms for document summarization. This algorithm uses Latent Semantic Analysis, which uses linear algebra to analyze how words and sentences are used in common. LSA is based on the "bag of words" concept of "term vectors", or a list of all words and how often it are used in each document. LSA uses Singular Value Decomposition (SVD) to tease out which words are used the most with other words, and which sentences use the most theme words together. Document Summarization with LSA uses SVD to give us main and secondary sentences which have the strongest collections of theme words and yet a minimal number of theme words in common.

Every document has words which express the themes of the document. These words are frequent, but they are also used together in sentences. We want to find the most important sentences and words; the main sentences should be shown as the summary, and the most important words are good search words for this document.

Orthogonal Sentences

A key idea is that the most important and second most important sentences in a document are independent: they tend to share few words. The most important sentence in a document expresses the main theme of the document, and the second most important uses other theme words to elaborate on the main sentence. When we express the sentences and words in a bag-of-words matrix, SVD can analyze the sentences and words in relation to each other, by how the terms are used together in sentences. It creates a sorted list of documents which are as orthogonal as possible: which means that their collective theme words are as different as possible.

Note that since this technique treats documents and terms symmetrically, it also creates a sorted list of terms by how important they are- this makes for a good tag cloud.

Example

Here is an example of the first two sentences from a financial newswire article.

Dean Foods Co expects earnings for the fourth quarter ending May 30 to exceed those of the same year-ago period, Chairman Kenneth Douglas told analysts. In the fiscal 1986 fourth quarter the food processor reported earnings of 40 cts a share.

The first sentence is long, and expresses the theme of the article. The second sentence elaborates on the first sentence, and does not share any real words except earnings. Note that in order to avoid repeating the company name, and to give more information, the second sentence elaborates on the first sentence and refers to Dean Foods as the food processor.

Next post: the experiment explained

SentenceLength

2012-08-21T01:08:00.001-07:00

SentenceLength, a photo by Norskhaus on Flickr.

Summarizing Documents with Machine Learning: Part 1

2012-02-20T00:19:00.000-08:00

Summarizing Documents with Machine Learning: Part 1

Creating Generic Text Summaries

The application is add summaries to the LucidFind application, a search engine devoted to open-source text processing projects.

Core concepts

Document Corpus

Term Vector

Singular Value Decomposition

Tying it together

Preliminary Results

Prelims are interesting, but nowhere near perfect. I've done this on one test document, and it clearly homes in on important declarative sentences. The test document is wrong for this- it's filled with formulae, and so do you make terms out of greek letters?

Raw Data

I've attached the raw text used in this exercise, and the sentences ordered by their SVD norm. I've tried several variations on the algorithm, and the first few sentences always turn up, so it's a pretty strong signal. Here are the sorted sentences, with the sentence # in the paper and the sorting value.
The most interesting:

{216, 1.0000000000000173} PERFORMANCE OF AVERAGED CLASSIFIERS 1719 and Mason [9] show how to break down the problem of learning alternating decision trees (a class of rules which generalizes decision trees and boosted decision trees) into a sequence of simpler learning problems using boosting.
{36, 1.0000000000000162} Second, we choose a class of prediction rules (mappings from the input to the binary output) and assume that there are prediction rules in that class whose probability of error (with respect to the distribution D) is small, but not necessarily equal to zero.
{44, 1.0000000000000142} On the other hand, if we allow the algorithm to output a zero, we can hope that the algorithm will output zero on about 4% of the input, and will be incorrect on about 1% of the data.
{26, 1.0000000000000129} Instead the basic argument is that the Bayesian method is always the best method, and therefore, the only important issues are how to choose a good prior distribution and how to efficiently calculate the posterior average.
{123, 1.0000000000000127} Unlike the event of a classification mistake, which depends both on the predicted label and the actual label, the event of predicting 0 does not depend on the actual label.
{179, 1.0000000000000122} For example, if the goal is to detect a rare type of instance within a large set, the correct method might be to sort all instances according to their log- ratio score and output the instances with the highest scores.

The middle:

{202, 1.0000000000000022} In other words, the behavior of our algorithm is, in fact, similar to that of large margin classifiers.
{11, 1.000000000000002} By minimiz- ing this cost, the learning algorithm attempts to minimize both the training error and the amount of overfitting.
{20, 1.000000000000002} PERFORMANCE OF AVERAGED CLASSIFIERS 1699 considerable experimental evidence that such averaging can significantly reduce the amount of overfitting suffered by the learning algorithm.
{21, 1.000000000000002} However, there is, we believe, a lack of theory for explaining this reduction.
{42, 1.000000000000002} MANSOUR AND R.
{55, 1.000000000000002} Of course, if the generated classifier outputs zero most of the time, then there is no benefit from having it.
{139, 1.000000000000002} Doing this also improves the guaranteed performance bounds because it reduces |H |.
{210, 1.000000000000002} Obviously, calculating the error of all of the hypotheses in the class is at least as hard as finding the best hypothesis and probably much harder.
{9, 1.0000000000000018} To overcome this problem, one usually uses either model-selection or reg- ularization terms.
{41, 1.0000000000000018} FREUND, Y.
{88, 1.0000000000000018} Note that the statements of Theorems 1 and 2 have no dependence on the hypothesis class H .
{92, 1.0000000000000018} The second inequality uses the fact that changing one example can change the empirical error by at most 1/m.
{95, 1.0000000000000018} Given x, we partition the hypothesis set H into two.

And the least:

{6, 1.0} Consider a binary classification learning problem.
{16, 1.0} 1Supported in part by a grant from the Israel Academy of Science.
{50, 1.0} Each of these makes two mistakes on this data set.
{149, 1.0} 1714 THEOREM 6.
{150, 1.0} FREUND, Y.
{238, 1.0} FREUND, Y.
{112, 0.9999999999999999} FREUND, Y.
{155, 0.9999999999999998} MANSOUR AND R.
{208, 0.9999999999999998} Consider first the computational issue.
{105, 0.999999999999999} MANSOUR AND R.
{239, 0.999999999999999} MANSOUR AND R.
{5, 0.9999999999999972} Introduction.

Notes to experts:

The terms were simple counts.
The sentences are sorted by the norms of the feature vectors. With the vectors multiplied by S, the longest sentences went straight to the top. Without S, length was nowhere near as strong.
The input data came from block copy&paste out of a PDF reader, processed with the OpenNLP Sentence Detector. It is suitably messy.
MANSOUR and FREUND are the authors; these terms appear in headers and footers and are thus important terms.

Next steps

Data cleaning with OpenNLP

Many of the sentences are so long that perhaps breaking longer sentences into clauses will give a more tractable dataset. Sentences with "A, however, B" and "C?" are probably not central to the discussion. OpenNLP is an Apache project for Natural-Language Processing. It has a sentence parsing toolkit that can help us split long sentences.

Tuning

The literature recommends various formulae for the term counts.

Singular vectors for recommendations

2011-08-15T02:03:00.000-07:00

This is a project to research:

Reproducing these results: http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/
Correlating the feature vectors and singular values from an SVD-based recommender to the generated vectors for users and items.

This was inspired by a lecture by one of the top 5 in the Netflix contest: the guy demonstrated axes of interest: chick flix v.s. Star Trek, Harry Potter v.s. Stanley Kubrick, etc. These clusters in the full item space are at the endpoints of vectors which can be realized from the feature vectors and singular values.

TestOpposites.java

This program and the following chart are my recreation of the raw data from the article above. BTJF are the original user/item values used to create the projection: Ben, Jeff, Tom, Fred. Bob, Love and Hate are Bob from the article; Love and Hate are users who love and hate all six seasons.

"Singular" and "Singular Div" are the first two feature vector columns of the SVD left-hand matrix. They are orthogonal. In the later chart, "Shifted data", we will use them to find "axes of interest" for the different items.

Raw data:

Shifted Data:
And now, the magic. The space of users is centered to 0. The two feature vectors are mirrored across 0,0, and the two orthogonal singular/feature vectors are downsized by their singular values. Normalized to add up to zero, the first singular value is 0.6 and the second is 0.25. In this chart, we take the original positions of the feature vectors and multiply them by 0.6 (yellow triangles) and 0.25 (red asterisks). And, I've drawn lines between the downsized versions. Now we have the 4 original users who established this space, three new users who are projected into this space, and the two major axes of interest.

Observations:

The four original users (Billy-Bob, Trimolchio, Jenga and Ferdinand) all had somewhat orthogonal item ratings, and come out in an arc. One of them is far from the others on a large circle, and the feature vectors make sense given the "gravity" of the four users on the circle.
Bob also had an item rating vector with the same style of pluses&minuses as BJTF, and appears at an expected place.
The singular feature vectors (yellow triangles, red asterisks) do give "axes of interest" that make sense v.s. BTJF and Bob.
Love and Hate are the nearest to the two ends of the dominant axis of interest. They are also nearly between the endpoints of the axes of interest.

Conclusions:

This technique gives two results:

It supplies axes of interest.
It allows a new user to describe himself based on the major axes of interest.

There is a fine yellow line between Love and Hate.

Sorted recommender data

2011-08-14T20:37:00.000-07:00

These are some images from experiments in sorting a ratings data model. This post has a few sorted versions of the GroupLens 1million sample database. Green and red are sequence values in the sorted output; they help visualize dense to sparse. Red is dense, green is sparse. 5% of men have red-green colorblindness.

Sorted by user:

Sorted by item:

Sorted by user and item:

Why is this interesting?
The lower left, red corner, has popular items. The upper right has the "long tail". I am a long tail guy; I don't really care about popular movies. Japanese female assassin movies, BBC comedy, Brazilian horror movies are on the list. An aquaintance received recommendations from Amazon for French Post-Structuralist literature (I don't know either) and pornographic comic books. "Someone finally understands me!"
A recommendation for me should be biased to the upper right. This sorting gives one algorithm to add to the pile.

Details
Program: SortDataModel.java, ModelComparator.java and maybe some other things in this repository.

https://github.com/LanceNorskog/LSH-Hadoop/blob/master/extras/mahout/test/java/org/apache/mahout/cf/taste/impl/common/SortDataModel.java

Visuals by KNime.

Koren on recommenders - comments about temporal changes in rating data

2011-08-14T20:19:00.000-07:00

http://videolectures.net/kdd09_koren_cftd/

Talks about how the timestamps for watching/rating events are important. In general, items change slowly in rating values, while users have sudden changes in rating values.

Slides stolen:

Dimensional Reduction via Random Projection, Part 4: Better Living Through Rotation!

2011-07-10T19:39:00.000-07:00

Thanks to a tip from Ted Dunning, I have rotated the four distributions via factorizing the distributions.


Gaussian	+1/-1

Sqrt3	Linear

These look mighty funky. I've taken this as far as I can without being able to browse the item metadata (movie names).

R Programming Resources #1

2011-07-07T20:13:00.000-07:00

The R type system: I knew there was something weird going on.
Operations on data frames
R Tutorial Series
R Bloggers

Dimensional Reduction via Random Projection, Part 1: Huh?

2011-07-01T02:33:00.000-07:00

This recounts a short research project in using random projection for visualization.

I started with vectors which represent items in a data model. (See my Semantic Vectors posts.) They are from 100k entries in the GroupLens film ratings dataset, representing ratings by 669 users on 3274 items. The 200 vectors are semi-randomly chosen from the item pool. All are 200 dimensions, dense. Euclidean distances signify nearness for Item-Item recommendation. The item-item distances give good ratings in the 150-200 dimension range, so it is reasonable to pick 200 dimensions as having rich support. The subset of item vectors have strong user-user bonds, so the recommendation structure is somewhat visible. This was a mistake, but I like it.

The project uses simple 2D visualization to decide whether random projection is as good as other, older methods of dimensional reduction. I used KNime's MDS implementation. MDS (Multi-Dimensional Scaling) is an older algorithm for creating a projection from a high dimensional space to a low dimensional, retaining pairwise distances. MDS is tuneable for the amount of information dropped.

Each test run uses the same 200 vectors in the same order. Each vector has the same color each time (from vivid green to black to vivid red).

The main axis of investigation is between three projection methods: complete Random Projection, partial Random Projection, and complete MDS. The following scatter charts show the vectors transformed to 2d vectors, and shown in a scatter-plot (X and Y are generated dimensions, color is the vector).

Random projection from 200 dimensions to 2 dimensions.
Random projection from 200 dimensions to 20 dimensions, then use MDS to boil down to 2 dimensions.
Use MDS to convert from 200 dimensions to two dimensions.

1. Full Random Projection from 200 dimensions to 2 dimensions (Gaussian)
2. Random Projection from 200D to 20D, then MDS to 2D (Gaussian, KNime MDS implementation)
3. Full MDS from 200d to 2d (KNime MDS Implementation)

Conclusions:

Using full random projection does not completely trash the data, but it's challenging to interpret.
Downsizing the data to 20 dimensions and then using MDS the rest of the way is somewhat cleaner, and good enough to work with. It's also faster.
Full MDS shows a strikingly clear picture of the item-item structure.

With later work it has become clear that these are the same projection, rotated. Later I would like to reduce to 3D and do animations of rotating these vector sets.

Onward to Part 2: there is more than one random projection.

I did these diagrams with the KNime visual programming app for data mining. All Hail KNime!

Dimensional Reduction via Random Projection, Part 2: Distributions

2011-07-01T02:32:00.001-07:00

There is more than one random distribution, and there are some surprises in store.

Achtioplas (2001) claims that random projection does not need fully distributed values: +1/-1 chosen with linear random suffices. Even better, a linear distribution of [0, 0, 0, 0, sqrt(3), -sqrt(3)] also works. This throws away 4 out of 6 input values.

This post explores applying these four different distributions to the same random projection. To recap, here is the best I've got: a full MDS projection from 200 dimensions to 2 dimensions.

Here are four versions of the same dataset, reducing 200 dimensions to 2 dimensions via random projection:


Gaussian	+1/-1

Sqrt(3)	Linear

The full MDS version is certainly the most pleasant to look at. The Gaussian and 2 distributions from Antiochplas all seem to be different rotations of a cylinder. The Linear distribution is useless in this situation.

Given these results, to do a quick visualization of your data, I would try all four random distributions; you may get lucky like I did with somewhat similar rotations. And, I recommend the colorizing trick; it really helps show what's going on here. After that, I would get a good dimensional reduction algorithm and do the 2-stage process:

hi-dimensional -> RP -> low-d -> formal dimensional reduction -> 2d or 3d.

On to part 3 for a discussion of noise.

Achlioptas, 2001
Database-friendly random projections: Johnson-Lindenstrauss with binary coins

PDF available online at various places:

I did these diagrams with the KNime visual programming app for data mining. All Hail KNime!

Dimensional Reduction via Random Projection, Part 3: Noise

2011-07-01T02:32:00.000-07:00

Now for a third axis of investigation: distances.

Random Projection preserves pairwise distances; this is its claim to fame. To measure this, I created matrices of distances before and after "full RP": 200-d -> RP -> 2d. Here are spreads of distances: each color is that vector's distances: the X axis is the 200-d distances, and the Y axis is the 2d distances.


Gaussian	+1/-1

Sqrt(3)	Linear

From tightest to loosest spreads, it's Linear, +1/-1, Sqrt(3) and Gaussian. Linear is so clean because it is overconstrained in this example. +1/-1 and Sqrt(3) are useable, and Gaussian looks like a bomb with smallpox. The +1/-1 and Sqrt(3) projectors look best for this case. If I wanted to work harder, I would compare the distances as matrices, find matrix norms, get standard deviations of the distances, etc. But for visualization, these worked well.

I did these diagrams with the KNime visual programming app for data mining. All Hail KNime!

Semantic Vectors for Recommenders

2010-11-27T18:57:00.000-08:00

Semantic Vectors for Recommenders

Start with a User->Item preference matrix. The values range from -0.5 to 0.5 in “linear” interpretation: 0 is neutral. -0.5 is twice as negative as -0.25. Blank values default to 0.

	Item1	Item2	Item3
User1	0.3	0.6	0.9
User2		0.1	0.4
User3			0.7

Now, let's do a semantic vectors projection. The User vector is random:

	Random
User1	0.2
User2	0.6
User3	0.9

The formula for the Item outputs is:

Item(i) = ((sum(U)+ sum(pref(u,i)/#U))/2)/#U

Where #U = the number of users expressing a preference for Item(i)

I1	Not enough preferences
I2	(((U1 + U2) + ((pref( u1,i2) + pref(u2,i2))/#U))/2)/#U
I3	(((U1 + U2 + U3) + ((pref( u1,i3) + pref(u2,i3) + pref(u3,i3))/#U)/2//#U

I1	No vector
I2	(((0.2 + 0.6) + (0.6 + 0.1)/2)/2)/2
I3	(((0.2 + 0.6 + 0.9) + (0.9 + 0.4 + 0.7)/3)/2)/3

I1	No vector
I2	0.2875
I3	0.3944…..

The resulting semantic vectors projection:

	Item1	Item2	Item3	User Vector

User1	0.3	0.6	0.9	0.2
User2		0.1	0.4	0.6
User3			0.7	0.9

Item Vector		0.2875	0.3944…

Here is a very difficult-to-read graph of the relationships:

Recommendations

The recommendations for the users are their vector’s distance from each item vector:

·
User1 would be most interested in Item3, and finally Item2.

·
User2 has interests the same order, but would find the whole list less interesting.

·
User3 would be most interested in Item2, then Item3.

Item-Item Similarity

The Item-Item similarity set is the distances between the
item vectors. Unfortunately, since item 1 has only one preference, it has no
vector projection.

Item2:Item3 = .11

User Vectors

The User-User distances are random, there is nothing to be learned.

Summary

The Semantic Vectors algorithm takes a matrix of Row->Column relationships and creates a set or Row-Column distances and a set of Column-Column relationships in a new, common numerical space. In this example, we created two sets of recommenders, a User->Item recommender and an Item->Item recommender.

Semantic Vectors - Part 1

2010-11-26T02:47:00.000-08:00

A matrix represents a set of relationships between different rows and different columns. For some purposes it is useful to derive from the matrix a set of relationships between the rows, or between the columns.

The Semantic Vectors algorithm projects a matrix onto two set of vectors, one set for the rows and one set for the columns.

Each vector corresponds to one row or column of the matrix.
Each vector is a point in space.
The distance between any two vectors corresponds to the delta between those rows or columns.

The two sets of vectors exist in two “parallel universes” but in
matching positions.

Semantic Vectors uses Random Projection (see other posts) to achieve this. The algorithm is one-pass and is very simple and intuitive: each row vector "tugs" each column vector into a position that matches its relationship to all of the rows. The relative distances of the column vectors encode the average of their respective relationships with each row. (If the matrix has a rank of one, all of the projected vectors will be the same; all pairs will have a distance of zero.not sure about this.)

Definition

The formal definition is: for all (Row, Column, Value) in R(0 -> 1) create a random number for each Row and each Column. These two sets of numbers are positions in two 1-dimensional parallel universes. Now, allow the random position of each Row item attract or repel the random position of each
Column, with the Value as the magnitude of attraction or repulsion.

C (in the parallel universe) = C + SUM for all r: (values/(r - C))/2

In calculating the new C, the old C is ignored. If C = 0, then the function is:

C = SUM for all r:R (value/r)/2

The informal explanation is that each row position tugs or repels the column position according to 1/2 of its value for C. C then moves one-half of the distances between C and each R. This guarantees that "lowest R" < C < "highest R".Each C is moved into a separate position in the C universe that corresponds to all of the R values for that C. The relative positions of all C express a (very) diluted representation of the relationships of all R v.s. all C.

User 1, +1.5, Item 1
User 2, +1.0, Item 1

User 1 at random position 0.1
User 2 at random position 0.6
Item position = (1.5/0.1)/2 + (1.0/0.6)/2 = 0.07 + 0.3 = 0.37

Resolution

Regard each R and C position as a 1-dimensional vector. In the above algorithm, the projected vectors are length one. The relationships are represented with poor resolution. The secret to making Semantic Vectors useful is to do the above operation many times to create longer and longer vectors. Each index in the vector uses different random positions for R. With different random positions for R, the projected vectors will contain uncorrelated placements for each index of the C vectors. All of the vectors at index N contain the distances in the same poor resolution, but combined they provide a quite serviceable representation of the original matrix.

Applications

Recommendation systems

If the rows are users, the columns are items, and the matrix is populated with preferences, the projected vectors will encode, in their respective distances, the similarity of two items. This is the basis of the SemanticVectorDataModel class: it includes all row vectors and column vectors, and the distance between two column vectors is their relative similarity. Overlaying the universes, the closest item vector is the most interesting item for a user. Item-Item distances give that similarity. The Item vectors can be clustered.

Word collocation

The rows are documents, the columns are words, and the matrix contents are the “collocation function”. A simple function is the number of times that word (C) appears in document (R). The resulting C Vectors encode the commonality of words in different documents. For example: if the rows represent titles of Disney
movies, and the columns represent the words in those titles, the C Vector for "Snow" will be nearest the vector for "White". (Assuming here are no other movies with 'snow' or 'white' in the title; at this point
there may be.) To find the similarity of documents, use words for the rows and documents for the columns.

Number of Dimensions

To find the right number of dimensions for your application by hand, pick an
epsilon and drop the dimension while it satisfies the epsilon. This is your
lowest usable dimension.

Random Projection

Random Projection projects a matrix onto another matrix (including vectors as one-row
matrices.) The algorithm populates a matrix with random numbers, and multiplies
the source matrix with it. The result is a projected matrix with (roughly) the
same rank as the source matrix. This allows the information encoded in the
source matrix to be projected with surprisingly good faithfulness into a
projected matrix of any shape. That is, if the source matrix is very low rank
(with an epsilon) the projected matrix will have a similar but degraded rank.

Uses

If the projected matrix is smaller than the source matrix, this serves as a form
of dimensional reduction. If a random vector projects a source matrix onto a vector,
the projected vector will encode the information with very poor resolution. But
the resolution will be greater than zero, and this may be enough for some
algorithms.

Implementation

Order

Computation order is O(dimensions * rows * columns). Interior coefficients (never forget the coefficients!) are the time to iterate over the rows and columns, and the time to generate random numbers.

Random Distributions and Distance

Euclidean or Manhattan istance measurements may both be equally acceptable. Sets of random numbers
added or multiplied quickly converge on a normal distribution. If you use a
linear distribution for the original random R positions, the distances between
vectors will be in a normal distribution anyway because the distances measures
add and multiply hundreds of random numbers. It will be easier to look at your
numbers in small test cases if you start with a normal distribution for your R
positions.

Mahout

RandomVector/RandomMatrix classes

The RandomVector and RandomMatrix classes in (a Mahout patch) supply the random row positions above. Only the seed values need be stored to reproduce the random multipliers.

SemanticVectors classes: DataModel/Recommenders/clusters

These create recommendations from a stored db of User and Item vector sets. The User vectors are created as needed from the seed configuration values.

Uncle Lance's Ultra Whiz Bang

Deep Meter - #0b (Post-mortem on first attempt)

Problems

Larger vision

Deep Meter - #0a (Sneak Peek!)

Backpropagation

Backpropagation

UAT

Backpropagation

References

Neural Networks Series #1: What's an SVD?

Fun with SVD: MNIST Digits

Document Summarization with LSA Appendix B: Software

The Test Harness

LSA toolkit

Reuters Corpus

Analysis

Further Reading

Document Summarization with LSA Part 0: Basics

Core concepts

Document Corpus

Term Vector

Singular Value Decomposition

Tying it together

Document Summarization with LSA Appendix: Tuning

Two problems

Very Long Sentences

Optimizations

Random Projection

Random Indexing

Fast Random Projection

Document Summarization with LSA #5: Software

The Test Harness

LSA toolkit and Solr DocumentSummarizer class

Reuters Corpus

Analysis

Further Reading

Next post: further tuning

Document Summarization with LSA #4: Individual measurements

The Measurements

Mean Reciprocal Rank

Rating (0-5)

Sentence Length

Non-Zero

Precision v.s. Recall

In Information Retrieval jargon, precision is the accuracy of a ranking algorithm, and recall is the ability to find results. The three success measurements are precision measures. The "dartboard" measure is a recall measure.

Previous Example

Further reading

Next post: software details

Document Summarization with LSA #3: Preliminary results

Overall Analysis

Measures are tuned for Interactive UI

Overall Comparison Chart

Grand Revelation

Next post: detailed analysis

Document Summarization with LSA #2: Test with newspaper articles

The Experiment

Supervised v.s. Unsupervised Learning

The Data

Example Data

The Code

The Result

Further Reading

Next post: the overall results

Document Summarization with LSA #1: Introduction

Document Summarization with LSA

Introduction

Orthogonal Sentences

Example

Further Reading

Next post: the experiment explained

SentenceLength

Summarizing Documents with Machine Learning: Part 1

Summarizing Documents with Machine Learning: Part 1

Core concepts

Document Corpus

Term Vector

Singular Value Decomposition

Tying it together

Preliminary Results