Saturday, October 27, 2018

Deep Meter - #0b (Post-mortem on first attempt)

Problems

The project has some serious flaws:

  1. The symbol distribution has a severe "Zipf problem". The majority of syllables occur once in the corpus, which common ones appear tens of thousands of times. Tensorflow has a feature to counter this problem; I will try it out soon.
  2. More important: the encoding itself. I wanted to keep it to a fixed number of syllables to avoid using sequential neural networks. However, the syllable format causes the above severe spread, and a huge dictionary size (15k for a larger corpus, 6k for this one). The CMUdict is itself in ARPAbet phonemes, and there are only 50. It will also have a severe Zipf, but should not have the ridiculously long tail. It will require a variable-size output, but it should take at most 10*4 phonemes to encode a ten-syllable sentence. That's the same information stored in 10*4*50 bits, instead of 10*15k bits.
    1. It is possible to keep the current encoding and hash syllables from the long tail. If they are only syllables that are part of longer words, the decoder can hunt for words of the form 'syllable-?' or 'syllable-?-syllable' when turning 10 one-hots into a sentence. This feels like hacking instead of solving the problem well.
  3. Not enough training data. There are many resources for well-formed sentences on the web. Just scanning any text for "10 syllables with the right meter" should flush out more sentences. Also, the "Paraphrase Database" project supplies millions of word&phrase pairs that mean the same thing. It should be possible to generate variations of a sentence by swapping in paraphrases that have a different meter.
    1. Another option is to split larger sentences into clauses. This is very slow, can't really scale this to the sizes we need.
  4. Storing training data. I tried pre-generating the USE vectors for my the sentences and it ran into the gigabytes quickly. This is we it re-generates the USE vector for each epoch. I believe this is the gating factor, since adding model layers did not slow it down appreciably. The Keras "read from directory" feature might be what I need. Not sure it will run faster from disk. That feature is designed for image processing.
  5. Source data for short sentences is hard to find. The MSCOCO database is a good place to start, it has 5 summaries apiece for 29k images.
  6. Evaluation functions: the loss function is wrong. There is no loss/eval function pair for "all 1s must be correct, all 0s must be 0, treating the two failures with equal weight".
  7. Decoder of CMUdict- need to write the "[[syllable, weight], ...] -> word set" decoder, which searches for possible sentences and scores them based on the one-hot value for the given syllable inside each syllable slot.

Larger vision

The concept is to generate various autoencoders for different meters. Since the decoder phase has 3 hidden layers, it might be possible to freeze the first two, and swap in a separate final hidden and decoder weight set for each different meter. This is on the supposition that the inner layers store higher abstractions and the outer layers deal with word generation. Dubious, but worth trying.

And then find a host & make a website where you can interactively try it.

Deep Meter - #0a (Sneak Peek!)

It's time to unveil my little toy project in Natural Language Processing (NLP). "Deep Meter" is a deep learning project which rephrases arbitrary English text in various poetic meters. The raw materials for this fever dream are as follows:
1) The Universal Sentence Encoder. This is a set of deep models which transform a clause, a sentence, or a paragraph into a "thought vector". That is, it turns the sentence "I hate my new iphone" into a set of 512 numbers that (very opaquely) encode these concepts: "recent purchase, mobile phone, strong negative sentiment, present tense". The USE also turns "This new mobile is killing me." into a different set of 512 numbers, but the cosine distance between the two vectors is very small. Since it encodes a set of concepts and not just a sequence of words, the USE could be the basis of an English-German translator. The USE is hosted and updated at Google in their "Tensorflow-Hub" model library project.
https://arxiv.org/abs/1803.11175
2) The Gutenberg Poetry Corpus. This is a conveniently packaged archive of most of the body text of every book in Product Gutenberg which is a compilation of poetry.
https://github.com/aparrish/gutenberg-poetry-corpus
3) The CMU Pronunciation Dictionary (CMUdict) is a database of common and uncommon English words, proper names, loanwords etc. which gives common pronunciations for each word. The pronunciations are given in the ARPAbet phoneme system. The entries are in fact in a variant of the ARPAbet that includes word stresses.
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
4) A version of the CMUdict which has syllable markers added. Used for early experiments. This is crucial to the meter classifier in #2.
https://webdocs.cs.ualberta.ca/~kondrak/cmudict.html
5) Tensorflow, Keras and Google Colaboratory
Tensorflow (TF) is a library (really an operating system packaged as a library) for doing machine learning. Keras is an abstract layer over TF and similar projects.
https://colab.research.google.com/notebooks/welcome.ipynb
6) An example notebook that wraps #1 in a convenient package for experimenting with the features of all items listed in #3.
https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/

Project:
The USE is distributed as only an encoder: it does not generate English sentences from its vectors.

This project creates a neural network that decodes vectors into sentences. The project plays a sneaky trick: it only trains the network on sentences which are in iambic pentameter. (What's i-p? Poetry in the stress format "du-DUH du-DUH du-DUH du-DUH du-DUH". Rhyme doesn't matter. Ten syllables with a hard rhythm is all that matters.) Since the network only knows how to output ten syllables in i-p form, and since the USE turns any sentence into an abstract thought vector, this network should be able to restate any short sentence (or sentence clause) in iambic pentameter.

Current status: I've written a few tools for data-wrangling (the bane of any machine learning project).
  • a library of utilities for parsing #4
  • code that reads lines of text and saves those lines which are in i-p and includes the syllable-ization according to CMUdict.
  • a Jupyter notebook (based on #6) that reads the above data. 
The experiment is showing positive signs. The network does some interesting generation: it often finds the right word, or an associated word. Since it works by syllable, it has to generate syllables in sequence that together form words. In one case it found a three-syllable-sequence ('eternal'). 

These samples are cherry-picked for quality and illumination. There are a lot of failures. There's also a lot of stuttering, both of source words and synonyms, of a dominant theme of the line. (These themes are often common in the corpus: stories of ancient heroes etc.) Here are the original sentences, and the hand-interpreted output of some interesting successes. Capitalized words are loose syllables that didn't match any surrounding words.

Stuttering.
  • A spear the hero bore of wondrous strength
  • a mighty sword the spear a sword of spear

A common occurrence is single syllables that are clearly part of a word or a synonym.  'annoy' is the only word that matches NOY. And, synonyms are common.
  • And noise, and tumult rises from the crowd
  • and crowd a loud the NOY the in the air

It's like an infomercial.
  • Forgot, nutritious, grateful to the taste
  • and health the goodness sweet the LEE the taste
'Cheery' for 'happy'. Trying for 'country'.
  • A happy nation, and a happy king.
  • the cheery proud TREE and his happy state

'AR' - army. 'joust' could be a cool association with army.
  • Of Greeks a mighty army, all in vain
  • of all the AR the Greeks the joust of Greeks
'SPIH' is spirit. 'TER' part of 'eternal', which it got correctly later in the sentence. This was the only 3-syllable success.
  • With this eternal silence; more a god
  • with TER IH SPIH IH god eternal god

Both end in 'DURE', it wants 'endure' for 'anguish' and 'torment'. WIH as in 'with', DIH is as in 'did'.
  • 'And suffer me in anguish to depart'
  • and leave for WIH and ang in EH DIH DUR

  • Cannot devise a torment, so it be
  • cannot a not by by it by DIH DURE
In short, many examples of finding a two-syllable word that is either in place or an associated word (synonym or more distant association). One example of a three-syllable word.

Google Colaboratory is a free online service that hosts Jupyter Notebooks and includes a free GPU (badly needed!). You have to sign up for Colab first, and then you can open any Jupyter notebook file from your Google Drive. The Deep Meter notebook here checks out the github project and uses the CMUdict code and cached Project Gutenberg poetry files that I classified by meter. If you sign up for Colab and upload this notebook from this branch, it might actually work. On a GPU runtime it takes maybe 1/2 hour to train. The VMs can be very slow, but GPU speed does not suffer. Don't try it on CPU or TPU, it will take forever!

If you have your own GPU farm, the only Colaboratory-specific code is the github check-out directory dance at the top. (A Colab notebook starts out in /content on a virtual machine). Everything else should be reproducible. The code uses a cached version of CMUdict.

https://github.com/LanceNorskog/deep_meter/tree/First-blog-post