Saturday, October 27, 2018

Deep Meter - #0b (Post-mortem on first attempt)

Problems

The project has some serious flaws:

  1. The symbol distribution has a severe "Zipf problem". The majority of syllables occur once in the corpus, which common ones appear tens of thousands of times. Tensorflow has a feature to counter this problem; I will try it out soon.
  2. More important: the encoding itself. I wanted to keep it to a fixed number of syllables to avoid using sequential neural networks. However, the syllable format causes the above severe spread, and a huge dictionary size (15k for a larger corpus, 6k for this one). The CMUdict is itself in ARPAbet phonemes, and there are only 50. It will also have a severe Zipf, but should not have the ridiculously long tail. It will require a variable-size output, but it should take at most 10*4 phonemes to encode a ten-syllable sentence. That's the same information stored in 10*4*50 bits, instead of 10*15k bits.
    1. It is possible to keep the current encoding and hash syllables from the long tail. If they are only syllables that are part of longer words, the decoder can hunt for words of the form 'syllable-?' or 'syllable-?-syllable' when turning 10 one-hots into a sentence. This feels like hacking instead of solving the problem well.
  3. Not enough training data. There are many resources for well-formed sentences on the web. Just scanning any text for "10 syllables with the right meter" should flush out more sentences. Also, the "Paraphrase Database" project supplies millions of word&phrase pairs that mean the same thing. It should be possible to generate variations of a sentence by swapping in paraphrases that have a different meter.
    1. Another option is to split larger sentences into clauses. This is very slow, can't really scale this to the sizes we need.
  4. Storing training data. I tried pre-generating the USE vectors for my the sentences and it ran into the gigabytes quickly. This is we it re-generates the USE vector for each epoch. I believe this is the gating factor, since adding model layers did not slow it down appreciably. The Keras "read from directory" feature might be what I need. Not sure it will run faster from disk. That feature is designed for image processing.
  5. Source data for short sentences is hard to find. The MSCOCO database is a good place to start, it has 5 summaries apiece for 29k images.
  6. Evaluation functions: the loss function is wrong. There is no loss/eval function pair for "all 1s must be correct, all 0s must be 0, treating the two failures with equal weight".
  7. Decoder of CMUdict- need to write the "[[syllable, weight], ...] -> word set" decoder, which searches for possible sentences and scores them based on the one-hot value for the given syllable inside each syllable slot.

Larger vision

The concept is to generate various autoencoders for different meters. Since the decoder phase has 3 hidden layers, it might be possible to freeze the first two, and swap in a separate final hidden and decoder weight set for each different meter. This is on the supposition that the inner layers store higher abstractions and the outer layers deal with word generation. Dubious, but worth trying.

And then find a host & make a website where you can interactively try it.

Deep Meter - #0a (Sneak Peek!)

It's time to unveil my little toy project in Natural Language Processing (NLP). "Deep Meter" is a deep learning project which rephrases arbitrary English text in various poetic meters. The raw materials for this fever dream are as follows:
1) The Universal Sentence Encoder. This is a set of deep models which transform a clause, a sentence, or a paragraph into a "thought vector". That is, it turns the sentence "I hate my new iphone" into a set of 512 numbers that (very opaquely) encode these concepts: "recent purchase, mobile phone, strong negative sentiment, present tense". The USE also turns "This new mobile is killing me." into a different set of 512 numbers, but the cosine distance between the two vectors is very small. Since it encodes a set of concepts and not just a sequence of words, the USE could be the basis of an English-German translator. The USE is hosted and updated at Google in their "Tensorflow-Hub" model library project.
https://arxiv.org/abs/1803.11175
2) The Gutenberg Poetry Corpus. This is a conveniently packaged archive of most of the body text of every book in Product Gutenberg which is a compilation of poetry.
https://github.com/aparrish/gutenberg-poetry-corpus
3) The CMU Pronunciation Dictionary (CMUdict) is a database of common and uncommon English words, proper names, loanwords etc. which gives common pronunciations for each word. The pronunciations are given in the ARPAbet phoneme system. The entries are in fact in a variant of the ARPAbet that includes word stresses.
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
4) A version of the CMUdict which has syllable markers added. Used for early experiments. This is crucial to the meter classifier in #2.
https://webdocs.cs.ualberta.ca/~kondrak/cmudict.html
5) Tensorflow, Keras and Google Colaboratory
Tensorflow (TF) is a library (really an operating system packaged as a library) for doing machine learning. Keras is an abstract layer over TF and similar projects.
https://colab.research.google.com/notebooks/welcome.ipynb
6) An example notebook that wraps #1 in a convenient package for experimenting with the features of all items listed in #3.
https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/

Project:
The USE is distributed as only an encoder: it does not generate English sentences from its vectors.

This project creates a neural network that decodes vectors into sentences. The project plays a sneaky trick: it only trains the network on sentences which are in iambic pentameter. (What's i-p? Poetry in the stress format "du-DUH du-DUH du-DUH du-DUH du-DUH". Rhyme doesn't matter. Ten syllables with a hard rhythm is all that matters.) Since the network only knows how to output ten syllables in i-p form, and since the USE turns any sentence into an abstract thought vector, this network should be able to restate any short sentence (or sentence clause) in iambic pentameter.

Current status: I've written a few tools for data-wrangling (the bane of any machine learning project).
  • a library of utilities for parsing #4
  • code that reads lines of text and saves those lines which are in i-p and includes the syllable-ization according to CMUdict.
  • a Jupyter notebook (based on #6) that reads the above data. 
The experiment is showing positive signs. The network does some interesting generation: it often finds the right word, or an associated word. Since it works by syllable, it has to generate syllables in sequence that together form words. In one case it found a three-syllable-sequence ('eternal'). 

These samples are cherry-picked for quality and illumination. There are a lot of failures. There's also a lot of stuttering, both of source words and synonyms, of a dominant theme of the line. (These themes are often common in the corpus: stories of ancient heroes etc.) Here are the original sentences, and the hand-interpreted output of some interesting successes. Capitalized words are loose syllables that didn't match any surrounding words.

Stuttering.
  • A spear the hero bore of wondrous strength
  • a mighty sword the spear a sword of spear

A common occurrence is single syllables that are clearly part of a word or a synonym.  'annoy' is the only word that matches NOY. And, synonyms are common.
  • And noise, and tumult rises from the crowd
  • and crowd a loud the NOY the in the air

It's like an infomercial.
  • Forgot, nutritious, grateful to the taste
  • and health the goodness sweet the LEE the taste
'Cheery' for 'happy'. Trying for 'country'.
  • A happy nation, and a happy king.
  • the cheery proud TREE and his happy state

'AR' - army. 'joust' could be a cool association with army.
  • Of Greeks a mighty army, all in vain
  • of all the AR the Greeks the joust of Greeks
'SPIH' is spirit. 'TER' part of 'eternal', which it got correctly later in the sentence. This was the only 3-syllable success.
  • With this eternal silence; more a god
  • with TER IH SPIH IH god eternal god

Both end in 'DURE', it wants 'endure' for 'anguish' and 'torment'. WIH as in 'with', DIH is as in 'did'.
  • 'And suffer me in anguish to depart'
  • and leave for WIH and ang in EH DIH DUR

  • Cannot devise a torment, so it be
  • cannot a not by by it by DIH DURE
In short, many examples of finding a two-syllable word that is either in place or an associated word (synonym or more distant association). One example of a three-syllable word.

Google Colaboratory is a free online service that hosts Jupyter Notebooks and includes a free GPU (badly needed!). You have to sign up for Colab first, and then you can open any Jupyter notebook file from your Google Drive. The Deep Meter notebook here checks out the github project and uses the CMUdict code and cached Project Gutenberg poetry files that I classified by meter. If you sign up for Colab and upload this notebook from this branch, it might actually work. On a GPU runtime it takes maybe 1/2 hour to train. The VMs can be very slow, but GPU speed does not suffer. Don't try it on CPU or TPU, it will take forever!

If you have your own GPU farm, the only Colaboratory-specific code is the github check-out directory dance at the top. (A Colab notebook starts out in /content on a virtual machine). Everything else should be reproducible. The code uses a cached version of CMUdict.

https://github.com/LanceNorskog/deep_meter/tree/First-blog-post

Sunday, September 23, 2018

Backpropagation


Backpropagation

UAT

The Universal Approximation Theorem (UAT) states that the simplest version of the structure of weights & non-linear functions used in deep learning. The UAT states roughly that this equation can be solved, to an approximation:
logit(numbers[] * input_weights[][]) * hidden_weights[][]) ~= other_numbers[]

Backpropagation

The UAT does not tell how to do this, only that it is possible. Later, the backpropagation (BP) algorithm was created to realize the possibility of the UAT. BP, as created, is only intended to solve one equation of N inputs and M unknowns. Backprogagation uses "hill-climing", which assumes that the solution of an equation is at the top of a hill. Wherever it starts, if it climbs uphill, it will eventually hit the top.

In a typical Deep Learning application, we take thousands of input samples which are similar, and turn them into equations of the above form. For a simple image classifier, this would be typical. The right-hand size corresponds to whether the image is a picture of a cat, a building or a car.
logit(image_pixels[] * input_weight_matrix[]) * hidden_weight_matrix[] ~= [1,0,0]
Wait a minute! In the math of equation-solving, each of these equations is a separate subspace, to be solved independently.

How did we get from approximating an equation (UAT) to approximating thousands of equations? The key idea of Deep Learning is to pretend that all of these equations are in one subspace, and that there is a mythical equation that is the centroid of all of these equations.
Deep Learning takes many (thousand) sample equations which are mathematically different subspaces, and essentially averages the result of the application of BP to each equation. To put it another way, each equation tugs the BP algorithm toward itself.

This seems like a misuse of BP- a misuse that has proven very very fruitful.

References

See the wikipedia pages for more info on the UAT, Backpropagation, and Gradient Descent:

https://en.wikipedia.org/wiki/Universal_approximation_theorem
https://en.wikipedia.org/wiki/Backpropagation
https://en.wikipedia.org/wiki/Gradient_descent








Neural Networks Series #1: What's an SVD?

Singular Value Decomposition (SVD) is a linear algebra algorithm that decomposes almost any matrix of numbers into 3 matrices, that when multiplied together will recreate the original matrix. There are several of these matrix decomposition techniques, useful for different purposes.
Before we delve into the uses of SVD, we will take a short detour. The rank of a matrix is the number of linearly separable rows (or columns) in the matrix, which means that:
        if (row[i] + x) * y + z = row[j] 
            then rows i and j are not linearly separable
Computing the rank is done by mutating the matrix into row echelon form, which substitutes a row of zeros for all of the rows which are considered "duplicates" under linear separability. Here is a tutorial on row echelon form:
https://stattrek.com/matrix-algebra/echelon-transform.aspx#MatrixA
This matrix has a rank of 1 because it is possible to generate any 3 of the rows from the fourth row with the above formula, and to generate the remaining columns:
[1,1,1,1]
[2,2,2,2]
[3,3,3,3]
[4,4,4,4]
This is called linear separability. Rank gives a primitive measurement for the amount of correlation between rows or columns. SVD is more nuanced. It implements linear separability as a continuous measurement whereas rank is only a discrete measurement. Separability is scored by Euclidean distance, and SVD creates a relative ranking of how close each row is to all other rows, and how close each column is to every other column. This ranking gives a couple of useful measurements about the matrix data:

  1. It gives a more nuanced measurement of separability than just the rank. It gives a vector of numbers which give the relative amount of separability across multiple columns.
  2. By ranking the distances among rows & columns, it gives a scoring of centrality, or "normal vs outlier".
This matrix has a rank of 2 because while it is not possible to generate matching rows or columns, they are close:
[1,2,3,4]
[2,3,4,5]
[3,4,5,6]
[4,5,6,7]
SVD generates a list of numbers called the "singular values" of a matrix. Here are the singular values for the above 2 matrices:
[1.00000000e+00 8.59975057e-17 2.52825014e-47 3.94185361e-79]
[9.98565020e-01 5.35527830e-02 1.96745978e-17 9.98144122e-18]
Let's chart these singular value vectors (linear scale):
These charts tell us that the first matrix has a rank of 1. The second matrix has a rank above one, but does not have two full rows of uniqueness. This chart is a measurement of the correlation between rows and correlation between columns. The singular values of a matrix are a good first look at the amount of "information" or "signal" or "entropy". "Quantity of signal" is a fairly slippery concept with mathematical definitions that border on the metaphysical, but the singular values are a useful (though simplistic) measurement of the amount of information contained in a matrix.
Based on the term "entropy" in the previous paragraph, let's try filling a larger matrix with random numbers. This is the measurement of a 20x50 random matrix, with a chart in log scaled on the Y axis:
Rank = 20
Singular values: 
[0.35032972 0.322733   0.30419963 0.29269556 0.28137789 0.26201005
 0.25742231 0.23075557 0.21945011 0.21763414 0.20245292 0.1976021
 0.175207   0.16692658 0.14884194 0.14208821 0.13474842 0.12291441
 0.10004981 0.08849001]
The graph should be a straight line but is not, due mainly to the fact "random numbers" are very hard to come by. This matrix has rank of 20- you cannot multiply & add any row and get another row. In fact, all rows are equidistant under Euclidean distance. Let's do something sneaky: we're going multiply it by itself to get a 50x50 matrix of random numbers. Here's are the measurements (again, a log-Y chart):
Rank = 20
[1.14271040e+02 9.69770480e+01 8.61587896e+01 7.97653909e+01
 7.37160653e+01 6.39172571e+01 6.16984957e+01 4.95777248e+01
 4.48387835e+01 4.40997633e+01 3.81619300e+01 3.63550994e+01
 2.85815085e+01 2.59437792e+01 2.06268491e+01 1.87974231e+01
 1.69055603e+01 1.40665565e+01 9.31997487e+00 7.29072452e+00
 1.50274491e-14 1.29473245e-14 1.24897761e-14 1.11334989e-14
 1.03937520e-14 9.90176181e-15 9.17746567e-15 8.94074462e-15
 8.58805508e-15 7.94287894e-15 6.90962668e-15 6.72135898e-15
 6.02639348e-15 5.66356533e-15 5.16429606e-15 4.84762468e-15
 4.47711092e-15 4.37266408e-15 4.27780206e-15 3.43884568e-15
 3.21440817e-15 2.83815151e-15 2.61710432e-15 2.34685343e-15
 2.24348424e-15 1.32915906e-15 9.12392110e-16 5.65320744e-16
 2.64346262e-16 1.11941235e-16]
Huh? Why is the rank still 20? And what's going on with that chart? When we multiplied the matrix by itself, we did not add any information/signal/entropy to the matrix! Therefore it still has the same amount of separability. It's as if we weighed a balloon, filled it with weightless air, then weighed it again.
In this post, we discussed SVD and the measurement of information in a matrix; this trick of multiplying a random matrix by itself to blow it up was what helped me understand SVD ten years ago. (Thanks to Ted Dunning on the Mahout mailing list.)
I've fast-forwarded like mad through days of linear algebra lecture, in order to give intuitive understanding about matrix analysis. We'll be using SVD in a few ways in this series on my explorations in Neural Networks. See this page for a good explanation of SVD and a cool example of how it can be used to clean up photographs. :
Source code for the above charts & values available at:

Sunday, April 15, 2018

Fun with SVD: MNIST Digits

Singular Value Decomposition is a very powerful linear algebra operation. It factorizes a matrix into three matrices which have some interesting properties. Among other things, SVD gives you a ranking of how 'typical' or 'normal' a row or column is in comparison to the other columns. I created a 100x784 matrix of gray scale values, where each row is a sample image and each column is a position in the 28x28 gray-scale raster. The following chart gives the 100 digits ranked from "outlier" to "central", with the normalized rank above each digit.


By this ranking, fuzzy/smudgy images are outliers and cleaner lines are central. Or, 4's are central. Here's the code:

# variation of code from Tariq's book)# python notebook for Make Your Own Neural Network
# working with the MNIST data set
#
# (c) Tariq Rashid, 2016
# license is GPLv2

import numpy as np
import matplotlib.pyplot as py
%matplotlib inline

# strangely, does not exist in numpy
def normalize(v):
    max_v = -10000000
    min_v = 10000000
    for x in v:
        if (x > max_v):
            max_v = x
        if (x < min_v):
            min_v = x
    scale = 1.0/(max_v - min_v)
    offset = -min_v
    for i in range(len(v)):
        v[i] = (v[i] + offset) * scale 
    return v

# open the CSV file and read its contents into a list
data_file = open("mnist_dataset/mnist_train_100.csv", 'r')
data_list = data_file.readlines()
data_file.close()
rows = len(data_list)
image_mat = np.zeros((rows, 28 * 28))
for row in range(rows):
    dig = data_list[row][0]
    all_values = data_list[row].split(',')
    image_vector = np.asfarray(all_values[1:])
    image_mat[row] = (image_vector / 255.0 * 0.99) + 0.01
(u, s, v) = np.linalg.svd(image_mat)
row_features = normalize(u.dot(s))
# py.plot(np.sort(row_features))
keys = np.argsort(row_features)

grid=10
fig,axes = py.subplots(nrows=rows//grid, ncols=grid)
fig.set_figheight(15)
fig.set_figwidth(15)
py.subplots_adjust(top=1.1)
for row in range(rows):
    ax = axes[row//grid][row%grid]
    ax.set_title("{0:.2f}".format(row_features[keys[row]]), fontsize=10)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.imshow(image_mat[keys[row]].reshape(28,28), cmap='Greys', interpolation='None')
fig.savefig('foo.png', bbox_inches='tight')

Sunday, April 8, 2018

Document Summarization with LSA Appendix B: Software

The Test Harness

The scripts for running these tests and the data are in my github repo: https://github.com/LanceNorskog/lsa.

LSA toolkit

The LSA toolkit is available under https://github.com/LanceNorskog/lsa/tree/master/research. The Solr code using the LSA library is not yet published. It is a hacked-up terror intertwined with my port of OpenNLP to Solr. I plan to create a generic version of the Solr Summarizer that directly uses a Solr text type rather than its own custom implementation of OpenNLP POS filtering. The OpenNLP port for Lucene/Solr is available as LUCENE-2899.

Reuters Corpus

The Reuters data and scripts for this analysis project are under https://github.com/LanceNorskog/lsa/tree/master/reuters. ...../data/raw is the Reuters article corpus preprocessed: the articles are reformatted into one sentence per line and are limited to 10+ sentences. The toolkit includes a script to run against the Solr Document Summarizer and save the XML output for each article, and a script to apply XPath expressions to create a CSV line for each article into one CSV file per algorithm. The per-algorithm keys include both the regularization algorithms and whether parts-of-speech filtering was applied.

Analysis

The analysis phase used KNime to preprocess the CSV data. KNime rules created more columns which were calculated from the generated columns, and then to create pivot table which summarized the data per algorithm. This data was saved into a new CSV file. KNime's charting facilities are very limited, so I used an Excel script to generate the charts. Excel 2010 failed on my Mac, and I had to make the charts in LibreOffice instead, but then copy them into a DOC file in MS Word (and not LibreOffice!) to get just plain jpegs from the charts.

Further Reading

The KNime data analysis toolkit is my favorite tool for exploring numbers. It is a visual programming UI (based on the Eclipse platform) which allows you to hook up statistics jobs, file I/O, Java code scriptlets and interactive graphs. Highly recommended for amateurs and the occasional user who cannot remember all of R.

The R platform is a massive library for scientific computing. I did the SVD example charts on the first page with R. R is big; I suggest using RStudio and the 'R Project Template' software.

Document Summarization with LSA Part 0: Basics

This is an investigation into analyzing documents with Latent Semantic Analysis, or LSA. In particular, we will find the key sentences and tag clouds in a technical paper with Singular Value Decomposition, inspired by this paper: 
The application is add summaries to the LucidFind application, a search engine devoted to open-source text processing projects.

Core concepts

Document Corpus

A corpus is just a collection of documents. There are many standardized text document corpuses out there used a data in research, the way standardized white mice strains are used in laboratory research on RAs. In this case, we use the sentences in one document as the corpus for that document. The matrix has sentences for rows and terms for columns.

Term Vector

A term vector is a "bag of words" with numbers attached. We are going to take all of the words in the document corpus and create a matrix where documents are on one side and words are on the other, and add a value for each word where it occurs in a document. The standard ways are to add a count, or just set it to one. This is a matrix where the rows are term vectors for the documents. This gives a numerical representation of the document corpus which has all grammatical information ripped out. This Bayesian representation is surprisingly effective, as we will see.

Singular Value Decomposition

SVD is a linear algebra operation. Given a matrix, it creates three matrices based on the original matrix. The one we care about is the "left feature matrix", commonly called U in the literature. The rows are "document feature vectors", which represent a kind of importance for each document.  A feature, in machine learning, refers to signal in noisy data. Applied to a matrix of sentences v.s. terms, the document with the largest value in the left-hand column is the most "important" in the corpus. The remaining values in the row also contribute to importance.

Tying it together

LSA makes a bag-of-words matrix from the document corpus, and then tries to find the documents that most reflect the "essence" of the corpus. In this project we try to find the most essential sentences in a technical paper by doing a linear algebra analysis using both the sentences and individual words.