Saturday, October 27, 2018

Deep Meter - #0a (Sneak Peek!)

It's time to unveil my little toy project in Natural Language Processing (NLP). "Deep Meter" is a deep learning project which rephrases arbitrary English text in various poetic meters. The raw materials for this fever dream are as follows:
1) The Universal Sentence Encoder. This is a set of deep models which transform a clause, a sentence, or a paragraph into a "thought vector". That is, it turns the sentence "I hate my new iphone" into a set of 512 numbers that (very opaquely) encode these concepts: "recent purchase, mobile phone, strong negative sentiment, present tense". The USE also turns "This new mobile is killing me." into a different set of 512 numbers, but the cosine distance between the two vectors is very small. Since it encodes a set of concepts and not just a sequence of words, the USE could be the basis of an English-German translator. The USE is hosted and updated at Google in their "Tensorflow-Hub" model library project.
https://arxiv.org/abs/1803.11175
2) The Gutenberg Poetry Corpus. This is a conveniently packaged archive of most of the body text of every book in Product Gutenberg which is a compilation of poetry.
https://github.com/aparrish/gutenberg-poetry-corpus
3) The CMU Pronunciation Dictionary (CMUdict) is a database of common and uncommon English words, proper names, loanwords etc. which gives common pronunciations for each word. The pronunciations are given in the ARPAbet phoneme system. The entries are in fact in a variant of the ARPAbet that includes word stresses.
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
4) A version of the CMUdict which has syllable markers added. Used for early experiments. This is crucial to the meter classifier in #2.
https://webdocs.cs.ualberta.ca/~kondrak/cmudict.html
5) Tensorflow, Keras and Google Colaboratory
Tensorflow (TF) is a library (really an operating system packaged as a library) for doing machine learning. Keras is an abstract layer over TF and similar projects.
https://colab.research.google.com/notebooks/welcome.ipynb
6) An example notebook that wraps #1 in a convenient package for experimenting with the features of all items listed in #3.
https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/

Project:
The USE is distributed as only an encoder: it does not generate English sentences from its vectors.

This project creates a neural network that decodes vectors into sentences. The project plays a sneaky trick: it only trains the network on sentences which are in iambic pentameter. (What's i-p? Poetry in the stress format "du-DUH du-DUH du-DUH du-DUH du-DUH". Rhyme doesn't matter. Ten syllables with a hard rhythm is all that matters.) Since the network only knows how to output ten syllables in i-p form, and since the USE turns any sentence into an abstract thought vector, this network should be able to restate any short sentence (or sentence clause) in iambic pentameter.

Current status: I've written a few tools for data-wrangling (the bane of any machine learning project).
  • a library of utilities for parsing #4
  • code that reads lines of text and saves those lines which are in i-p and includes the syllable-ization according to CMUdict.
  • a Jupyter notebook (based on #6) that reads the above data. 
The experiment is showing positive signs. The network does some interesting generation: it often finds the right word, or an associated word. Since it works by syllable, it has to generate syllables in sequence that together form words. In one case it found a three-syllable-sequence ('eternal'). 

These samples are cherry-picked for quality and illumination. There are a lot of failures. There's also a lot of stuttering, both of source words and synonyms, of a dominant theme of the line. (These themes are often common in the corpus: stories of ancient heroes etc.) Here are the original sentences, and the hand-interpreted output of some interesting successes. Capitalized words are loose syllables that didn't match any surrounding words.

Stuttering.
  • A spear the hero bore of wondrous strength
  • a mighty sword the spear a sword of spear

A common occurrence is single syllables that are clearly part of a word or a synonym.  'annoy' is the only word that matches NOY. And, synonyms are common.
  • And noise, and tumult rises from the crowd
  • and crowd a loud the NOY the in the air

It's like an infomercial.
  • Forgot, nutritious, grateful to the taste
  • and health the goodness sweet the LEE the taste
'Cheery' for 'happy'. Trying for 'country'.
  • A happy nation, and a happy king.
  • the cheery proud TREE and his happy state

'AR' - army. 'joust' could be a cool association with army.
  • Of Greeks a mighty army, all in vain
  • of all the AR the Greeks the joust of Greeks
'SPIH' is spirit. 'TER' part of 'eternal', which it got correctly later in the sentence. This was the only 3-syllable success.
  • With this eternal silence; more a god
  • with TER IH SPIH IH god eternal god

Both end in 'DURE', it wants 'endure' for 'anguish' and 'torment'. WIH as in 'with', DIH is as in 'did'.
  • 'And suffer me in anguish to depart'
  • and leave for WIH and ang in EH DIH DUR

  • Cannot devise a torment, so it be
  • cannot a not by by it by DIH DURE
In short, many examples of finding a two-syllable word that is either in place or an associated word (synonym or more distant association). One example of a three-syllable word.

Google Colaboratory is a free online service that hosts Jupyter Notebooks and includes a free GPU (badly needed!). You have to sign up for Colab first, and then you can open any Jupyter notebook file from your Google Drive. The Deep Meter notebook here checks out the github project and uses the CMUdict code and cached Project Gutenberg poetry files that I classified by meter. If you sign up for Colab and upload this notebook from this branch, it might actually work. On a GPU runtime it takes maybe 1/2 hour to train. The VMs can be very slow, but GPU speed does not suffer. Don't try it on CPU or TPU, it will take forever!

If you have your own GPU farm, the only Colaboratory-specific code is the github check-out directory dance at the top. (A Colab notebook starts out in /content on a virtual machine). Everything else should be reproducible. The code uses a cached version of CMUdict.

https://github.com/LanceNorskog/deep_meter/tree/First-blog-post

No comments:

Post a Comment