Problems
The project has some serious flaws:- The symbol distribution has a severe "Zipf problem". The majority of syllables occur once in the corpus, which common ones appear tens of thousands of times. Tensorflow has a feature to counter this problem; I will try it out soon.
- More important: the encoding itself. I wanted to keep it to a fixed number of syllables to avoid using sequential neural networks. However, the syllable format causes the above severe spread, and a huge dictionary size (15k for a larger corpus, 6k for this one). The CMUdict is itself in ARPAbet phonemes, and there are only 50. It will also have a severe Zipf, but should not have the ridiculously long tail. It will require a variable-size output, but it should take at most 10*4 phonemes to encode a ten-syllable sentence. That's the same information stored in 10*4*50 bits, instead of 10*15k bits.
- It is possible to keep the current encoding and hash syllables from the long tail. If they are only syllables that are part of longer words, the decoder can hunt for words of the form 'syllable-?' or 'syllable-?-syllable' when turning 10 one-hots into a sentence. This feels like hacking instead of solving the problem well.
- Not enough training data. There are many resources for well-formed sentences on the web. Just scanning any text for "10 syllables with the right meter" should flush out more sentences. Also, the "Paraphrase Database" project supplies millions of word&phrase pairs that mean the same thing. It should be possible to generate variations of a sentence by swapping in paraphrases that have a different meter.
- Another option is to split larger sentences into clauses. This is very slow, can't really scale this to the sizes we need.
- Storing training data. I tried pre-generating the USE vectors for my the sentences and it ran into the gigabytes quickly. This is we it re-generates the USE vector for each epoch. I believe this is the gating factor, since adding model layers did not slow it down appreciably. The Keras "read from directory" feature might be what I need. Not sure it will run faster from disk. That feature is designed for image processing.
- Source data for short sentences is hard to find. The MSCOCO database is a good place to start, it has 5 summaries apiece for 29k images.
- Evaluation functions: the loss function is wrong. There is no loss/eval function pair for "all 1s must be correct, all 0s must be 0, treating the two failures with equal weight".
- Decoder of CMUdict- need to write the "[[syllable, weight], ...] -> word set" decoder, which searches for possible sentences and scores them based on the one-hot value for the given syllable inside each syllable slot.
Larger vision
The concept is to generate various autoencoders for different meters. Since the decoder phase has 3 hidden layers, it might be possible to freeze the first two, and swap in a separate final hidden and decoder weight set for each different meter. This is on the supposition that the inner layers store higher abstractions and the outer layers deal with word generation. Dubious, but worth trying.
And then find a host & make a website where you can interactively try it.