tag:blogger.com,1999:blog-30950493722196747922023-11-15T22:31:13.529-08:00Uncle Lance's Ultra Whiz BangLance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.comBlogger28125tag:blogger.com,1999:blog-3095049372219674792.post-75038401960129529262018-10-27T19:12:00.000-07:002018-10-27T19:12:02.627-07:00Deep Meter - #0b (Post-mortem on first attempt)<h3>
Problems</h3>
The project has some serious flaws:<br />
<br />
<ol>
<li>The symbol distribution has a severe "Zipf problem". The majority of syllables occur once in the corpus, which common ones appear tens of thousands of times. Tensorflow has a feature to counter this problem; I will try it out soon.</li>
<li>More important: the encoding itself. I wanted to keep it to a fixed number of syllables to avoid using sequential neural networks. However, the syllable format causes the above severe spread, and a huge dictionary size (15k for a larger corpus, 6k for this one). The CMUdict is itself in ARPAbet phonemes, and there are only 50. It will also have a severe Zipf, but should not have the ridiculously long tail. It will require a variable-size output, but it should take at most 10*4 phonemes to encode a ten-syllable sentence. That's the same information stored in 10*4*50 bits, instead of 10*15k bits.</li>
<ol>
<li>It is possible to keep the current encoding and hash syllables from the long tail. If they are only syllables that are part of longer words, the decoder can hunt for words of the form 'syllable-?' or 'syllable-?-syllable' when turning 10 one-hots into a sentence. This feels like hacking instead of solving the problem well.</li>
</ol>
<li>Not enough training data. There are many resources for well-formed sentences on the web. Just scanning any text for "10 syllables with the right meter" should flush out more sentences. Also, the "Paraphrase Database" project supplies millions of word&phrase pairs that mean the same thing. It should be possible to generate variations of a sentence by swapping in paraphrases that have a different meter.</li>
<ol>
<li>Another option is to split larger sentences into clauses. This is very slow, can't really scale this to the sizes we need.</li>
</ol>
<li>Storing training data. I tried pre-generating the USE vectors for my the sentences and it ran into the gigabytes quickly. This is we it re-generates the USE vector for each epoch. I believe this is the gating factor, since adding model layers did not slow it down appreciably. The Keras "read from directory" feature might be what I need. Not sure it will run faster from disk. That feature is designed for image processing.</li>
<li>Source data for short sentences is hard to find. The MSCOCO database is a good place to start, it has 5 summaries apiece for 29k images.</li>
<li>Evaluation functions: the loss function is wrong. There is no loss/eval function pair for "all 1s must be correct, all 0s must be 0, treating the two failures with equal weight".</li>
<li>Decoder of CMUdict- need to write the "[[syllable, weight], ...] -> word set" decoder, which searches for possible sentences and scores them based on the one-hot value for the given syllable inside each syllable slot.</li>
</ol>
<h3>
Larger vision</h3>
<div>
The concept is to generate various autoencoders for different meters. Since the decoder phase has 3 hidden layers, it might be possible to freeze the first two, and swap in a separate final hidden and decoder weight set for each different meter. This is on the supposition that the inner layers store higher abstractions and the outer layers deal with word generation. Dubious, but worth trying.</div>
<div>
<br /></div>
<div>
And then find a host & make a website where you can interactively try it.</div>
Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com1tag:blogger.com,1999:blog-3095049372219674792.post-12380096359086718592018-10-27T18:29:00.001-07:002018-11-04T16:21:02.440-08:00Deep Meter - #0a (Sneak Peek!)It's time to unveil my little toy project in Natural Language Processing (NLP). "Deep Meter" is a deep learning project which rephrases arbitrary English text in various poetic meters. The raw materials for this fever dream are as follows:<br />
1) The Universal Sentence Encoder. This is a set of deep models which transform a clause, a sentence, or a paragraph into a "thought vector". That is, it turns the sentence "I hate my new iphone" into a set of 512 numbers that (very opaquely) encode these concepts: "recent purchase, mobile phone, strong negative sentiment, present tense". The USE also turns "This new mobile is killing me." into a different set of 512 numbers, but the cosine distance between the two vectors is very small. Since it encodes a set of concepts and not just a sequence of words, the USE could be the basis of an English-German translator. The USE is hosted and updated at Google in their "Tensorflow-Hub" model library project.<br />
<a href="https://arxiv.org/abs/1803.11175" style="color: #3465a4;">https://arxiv.org/abs/1803.11175</a><br />
2) The Gutenberg Poetry Corpus. This is a conveniently packaged archive of most of the body text of every book in Product Gutenberg which is a compilation of poetry.<br />
<a href="https://github.com/aparrish/gutenberg-poetry-corpus" style="color: #3465a4;">https://github.com/aparrish/gutenberg-poetry-corpus</a><br />
3) The CMU Pronunciation Dictionary (CMUdict) is a database of common and uncommon English words, proper names, loanwords etc. which gives common pronunciations for each word. The pronunciations are given in the ARPAbet phoneme system. The entries are in fact in a variant of the ARPAbet that includes word stresses.<br />
<a href="http://www.speech.cs.cmu.edu/cgi-bin/cmudict" style="color: #3465a4;">http://www.speech.cs.cmu.edu/cgi-bin/cmudict</a><br />
4) A version of the CMUdict which has syllable markers added. Used for early experiments. This is crucial to the meter classifier in #2.<br />
<a href="https://webdocs.cs.ualberta.ca/~kondrak/cmudict.html" style="color: #3465a4;">https://webdocs.cs.ualberta.ca/~kondrak/cmudict.html</a><br />
5) Tensorflow, Keras and Google Colaboratory<br />
Tensorflow (TF) is a library (really an operating system packaged as a library) for doing machine learning. Keras is an abstract layer over TF and similar projects.<br />
<a href="https://colab.research.google.com/notebooks/welcome.ipynb" style="color: #3465a4;">https://colab.research.google.com/notebooks/welcome.ipynb</a><br />
6) An example notebook that wraps #1 in a convenient package for experimenting with the features of all items listed in #3.<br />
<a href="https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/" style="color: #3465a4;">https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/</a><br />
<br />
Project:<br />
The USE is distributed as only an encoder: it does not generate English sentences from its vectors.<br />
<br />
This project creates a neural network that decodes vectors into sentences. The project plays a sneaky trick: it only trains the network on sentences which are in iambic pentameter. (What's i-p? Poetry in the stress format "du-DUH du-DUH du-DUH du-DUH du-DUH". Rhyme doesn't matter. Ten syllables with a hard rhythm is all that matters.) Since the network only knows how to output ten syllables in i-p form, and since the USE turns any sentence into an abstract thought vector, this network should be able to restate any short sentence (or sentence clause) in iambic pentameter.<br />
<br />
Current status: I've written a few tools for data-wrangling (the bane of any machine learning project).<br />
<ul>
<li dir="ltr">a library of utilities for parsing #4</li>
<li dir="ltr">code that reads lines of text and saves those lines which are in i-p and includes the syllable-ization according to CMUdict. </li>
<li dir="ltr">a Jupyter notebook (based on #6) that reads the above data. </li>
</ul>
<div>
The experiment is showing positive signs. The network does some interesting generation: it often finds the right word, or an associated word. Since it works by syllable, it has to generate syllables in sequence that together form words. In one case it found a three-syllable-sequence ('eternal'). </div>
<div>
<br /></div>
These samples are cherry-picked for quality and illumination. There are a lot of failures. There's also a lot of stuttering, both of source words and synonyms, of a dominant theme of the line. (These themes are often common in the corpus: stories of ancient heroes etc.) Here are the original sentences, and the hand-interpreted output of some interesting successes. Capitalized words are loose syllables that didn't match any surrounding words.<br />
<br />
Stuttering.<br />
<ul>
<li>A spear the hero bore of wondrous strength</li>
<li>a mighty sword the spear a sword of spear</li>
</ul>
<br />
A common occurrence is single syllables that are clearly part of a word or a synonym. 'annoy' is the only word that matches NOY. And, synonyms are common.<br />
<ul>
<li>And noise, and tumult rises from the crowd</li>
<li>and crowd a loud the NOY the in the air</li>
</ul>
<br />
It's like an infomercial.<br />
<ul>
<li>Forgot, nutritious, grateful to the taste</li>
<li>and health the goodness sweet the LEE the taste</li>
</ul>
'Cheery' for 'happy'. Trying for 'country'. <br />
<ul>
<li>A happy nation, and a happy king.</li>
<li></li>
<li>the cheery proud TREE and his happy state</li>
</ul>
<br />
'AR' - army. 'joust' could be a cool association with army.<br />
<ul>
<li>Of Greeks a mighty army, all in vain</li>
<li>of all the AR the Greeks the joust of Greeks</li>
</ul>
'SPIH' is spirit. 'TER' part of 'eternal', which it got correctly later in the sentence. This was the only 3-syllable success.<br />
<ul>
<li>With this eternal silence; more a god</li>
<li></li>
<li>with TER IH SPIH IH god eternal god</li>
</ul>
<br />
Both end in 'DURE', it wants 'endure' for 'anguish' and 'torment'. WIH as in 'with', DIH is as in 'did'.<br />
<ul>
<li>'And suffer me in anguish to depart'</li>
<li>and leave for WIH and ang in EH DIH DUR</li>
<li><br /></li>
<li>Cannot devise a torment, so it be</li>
<li></li>
<li>cannot a not by by it by DIH DURE</li>
</ul>
<div>
<div>
In short, many examples of finding a two-syllable word that is either in place or an associated word (synonym or more distant association). One example of a three-syllable word.<br />
<br /></div>
Google Colaboratory is a free online service that hosts Jupyter Notebooks and includes a free GPU (badly needed!). You have to sign up for Colab first, and then you can open any Jupyter notebook file from your Google Drive. The Deep Meter notebook here checks out the github project and uses the CMUdict code and cached Project Gutenberg poetry files that I classified by meter. If you sign up for Colab and upload this notebook from this branch, it might actually work. On a GPU runtime it takes maybe 1/2 hour to train. The VMs can be very slow, but GPU speed does not suffer. Don't try it on CPU or TPU, it will take forever!<br />
<br />
If you have your own GPU farm, the only Colaboratory-specific code is the github check-out directory dance at the top. (A Colab notebook starts out in /content on a virtual machine). Everything else should be reproducible. The code uses a cached version of CMUdict.<br />
<br />
<a href="https://github.com/LanceNorskog/deep_meter/tree/First-blog-post" style="color: #3465a4;">https://github.com/LanceNorskog/deep_meter/tree/First-blog-post</a></div>
Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-21030499150100916172018-09-23T17:28:00.001-07:002018-10-27T18:31:04.509-07:00Backpropagation<br />
<h2>
Backpropagation</h2>
<h4>
UAT</h4>
The Universal Approximation Theorem (UAT) states that the simplest version of the structure of weights & non-linear functions used in deep learning. The UAT states roughly that this equation can be solved, to an approximation:<br />
<div style="text-align: center;">
logit(numbers[] * input_weights[][]) * hidden_weights[][]) ~= other_numbers[]</div>
<h4>
Backpropagation</h4>
The UAT does not tell how to do this, only that it is possible. Later, the backpropagation (BP) algorithm was created to realize the possibility of the UAT. BP, as created, is only intended to solve one equation of N inputs and M unknowns. Backprogagation uses "hill-climing", which assumes that the solution of an equation is at the top of a hill. Wherever it starts, if it climbs uphill, it will eventually hit the top.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnI_kkyfxuCco-OigwJjHz6CuuPOxEnj28rQdCYq6spQaAuT7LOGRaX5sMi9koFGbDeZkSpyD1U11_mJ3IX4fMQKOuqCDLP-IdetbMPQENfUeZw5oMtHHzBL0icIdNveKEr-ywvo2eHdA/s1600/Very_Smooth_Solver.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="260" data-original-width="640" height="81" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnI_kkyfxuCco-OigwJjHz6CuuPOxEnj28rQdCYq6spQaAuT7LOGRaX5sMi9koFGbDeZkSpyD1U11_mJ3IX4fMQKOuqCDLP-IdetbMPQENfUeZw5oMtHHzBL0icIdNveKEr-ywvo2eHdA/s200/Very_Smooth_Solver.png" width="200" /></a></div>
<br />
In a typical Deep Learning application, we take thousands of input samples which are similar, and turn them into equations of the above form. For a simple image classifier, this would be typical. The right-hand size corresponds to whether the image is a picture of a cat, a building or a car.<br />
<div style="text-align: center;">
logit(image_pixels[] * input_weight_matrix[]) * hidden_weight_matrix[] ~= [1,0,0]</div>
Wait a minute! In the math of equation-solving, each of these equations is a separate subspace, to be solved independently.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiD2U_WQpdLjj307_4x66OKv8vr5Jj0vIVyt4JoqGsY3VfSscU9eyfXDVf9ue-MOw_sLOYS8x51szZlG58-8ORQo6NrRK5iB-fMNPrOpp_eZ1ffdqXB7NwORZsjGvOt3SvcBN1vNI9WRHQ/s1600/Separate_Universes.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="310" data-original-width="640" height="155" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiD2U_WQpdLjj307_4x66OKv8vr5Jj0vIVyt4JoqGsY3VfSscU9eyfXDVf9ue-MOw_sLOYS8x51szZlG58-8ORQo6NrRK5iB-fMNPrOpp_eZ1ffdqXB7NwORZsjGvOt3SvcBN1vNI9WRHQ/s320/Separate_Universes.png" width="320" /></a></div>
<br />
How did we get from approximating an equation (UAT) to approximating thousands of equations? The key idea of Deep Learning is to pretend that all of these equations are in one subspace, and that there is a mythical equation that is the centroid of all of these equations.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijeuQUUM9I_B_SQlD1T8EDFMELe_idc1daoj5BIM5QaM_1IiwvRCRy3TdC8DrV54OTUuaT0Z4FZL1I8RyTwOuYT-jLleo2eHA7l4i9CBOnqvWS_ULQxbTZDPKVOCRa64hL2ZCDYRWxmfk/s1600/One_Universe.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="311" data-original-width="644" height="153" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijeuQUUM9I_B_SQlD1T8EDFMELe_idc1daoj5BIM5QaM_1IiwvRCRy3TdC8DrV54OTUuaT0Z4FZL1I8RyTwOuYT-jLleo2eHA7l4i9CBOnqvWS_ULQxbTZDPKVOCRa64hL2ZCDYRWxmfk/s320/One_Universe.png" width="320" /></a></div>
Deep Learning takes many (thousand) sample equations which are mathematically different subspaces, and essentially averages the result of the application of BP to each equation. To put it another way, each equation tugs the BP algorithm toward itself.<br />
<br />
This seems like a misuse of BP- a misuse that has proven very very fruitful.<br />
<h4>
References</h4>
See the wikipedia pages for more info on the UAT, Backpropagation, and Gradient Descent:<br />
<br />
<a href="https://en.wikipedia.org/wiki/Universal_approximation_theorem" style="color: #3465a4;">https://en.wikipedia.org/wiki/Universal_approximation_theorem</a><br />
<a href="https://en.wikipedia.org/wiki/Backpropagation">https://en.wikipedia.org/wiki/Backpropagation</a><br />
<a href="https://en.wikipedia.org/wiki/Gradient_descent">https://en.wikipedia.org/wiki/Gradient_descent</a><br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-38447329457675817062018-09-23T16:17:00.000-07:002018-09-23T16:17:03.646-07:00Neural Networks Series #1: What's an SVD?<b>Singular Value Decomposition (SVD)</b> is a linear algebra algorithm that decomposes almost any matrix of numbers into 3 matrices, that when multiplied together will recreate the original matrix. There are several of these matrix decomposition techniques, useful for different purposes.<br />
Before we delve into the uses of SVD, we will take a short detour. The <i>rank</i> of a matrix is the number of linearly separable rows (or columns) in the matrix, which means that:<br />
<i>if (row[i] + x) * y + z = row[j] </i><br />
<i> then rows i and j are not linearly separable</i><br />
Computing the rank is done by mutating the matrix into <i>row echelon form</i>, which substitutes a row of zeros for all of the rows which are considered "duplicates" under linear separability. Here is a tutorial on row echelon form:<br />
<a href="https://stattrek.com/matrix-algebra/echelon-transform.aspx#MatrixA">https://stattrek.com/matrix-algebra/echelon-transform.aspx#MatrixA</a><br />
This matrix has a rank of 1 because it is possible to generate any 3 of the rows from the fourth row with the above formula, and to generate the remaining columns:<br />
[1,1,1,1]<br />
[2,2,2,2]<br />
[3,3,3,3]<br />
[4,4,4,4]<br />
This is called <i>linear separability</i>. Rank gives a primitive measurement for the amount of correlation between rows or columns. SVD is more nuanced. It implements linear separability as a continuous measurement whereas rank is only a discrete measurement. Separability is scored by Euclidean distance, and SVD creates a relative ranking of how close each row is to all other rows, and how close each column is to every other column. This ranking gives a couple of useful measurements about the matrix data:<br />
<br />
<ol>
<li>It gives a more nuanced measurement of separability than just the rank. It gives a vector of numbers which give the relative amount of separability across multiple columns.</li>
<li>By ranking the distances among rows & columns, it gives a scoring of centrality, or "normal vs outlier".</li>
</ol>
<div>
This matrix has a rank of 2 because while it is not possible to generate matching rows or columns, they are close:</div>
<div>
<div>
[1,2,3,4]</div>
<div>
[2,3,4,5]</div>
<div>
[3,4,5,6]</div>
<div>
[4,5,6,7]</div>
</div>
<div>
SVD generates a list of numbers called the "singular values" of a matrix. Here are the singular values for the above 2 matrices:</div>
<div>
<pre style="background-color: white; border-radius: 0px; border: 0px; box-sizing: border-box; font-size: 14px; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;">[1.00000000e+00 8.59975057e-17 2.52825014e-47 3.94185361e-79]</pre>
<pre style="background-color: white; border-radius: 0px; border: 0px; box-sizing: border-box; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><pre style="border-radius: 0px; border: 0px; box-sizing: border-box; font-size: 14px; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;">[9.98565020e-01 5.35527830e-02 1.96745978e-17 9.98144122e-18]</pre>
<pre style="border-radius: 0px; border: 0px; box-sizing: border-box; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><span style="font-family: "times" , "times new roman" , serif;">Let's chart these singular value vectors (linear scale):</span></pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhONeolOggiZi4ZzsrkWJmx_xeEvdE2ksOG_l2hdPE6I1q99VDA1KFPsR18c8_n-FSrdmcLZnZPR3x9GndzQU_9jh9eSYJ7QKyNzFMioVM9fx-itQu0g5rurQK9w0CG1t2wZJSbVsvKfe4/s1600/svd_4_4.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em; text-align: left;"><span style="font-family: "times" , "times new roman" , serif;"></span></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhONeolOggiZi4ZzsrkWJmx_xeEvdE2ksOG_l2hdPE6I1q99VDA1KFPsR18c8_n-FSrdmcLZnZPR3x9GndzQU_9jh9eSYJ7QKyNzFMioVM9fx-itQu0g5rurQK9w0CG1t2wZJSbVsvKfe4/s1600/svd_4_4.png" imageanchor="1" style="clear: right; display: inline !important; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="288" data-original-width="432" height="133" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhONeolOggiZi4ZzsrkWJmx_xeEvdE2ksOG_l2hdPE6I1q99VDA1KFPsR18c8_n-FSrdmcLZnZPR3x9GndzQU_9jh9eSYJ7QKyNzFMioVM9fx-itQu0g5rurQK9w0CG1t2wZJSbVsvKfe4/s200/svd_4_4.png" width="200" /></a></div>
<div style="font-size: 14px; margin-left: 1em; margin-right: 1em; text-align: left;">
<span style="font-family: "times" , "times new roman" , serif; margin-left: 1em; margin-right: 1em;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCO3ziXlYYZqS073TWZErhPMe5VhJ7Ci57c-q8xsg8meee0Gsinm37qMdtmWAv8pETD8_lxvPCYSdXiUrUDOqTbBB6EYO1b50wLuLPLPidzifyIs2qmKjcLj17rJK0TqLUFMRzG49IIcQ/s1600/svd_4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="288" data-original-width="432" height="133" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCO3ziXlYYZqS073TWZErhPMe5VhJ7Ci57c-q8xsg8meee0Gsinm37qMdtmWAv8pETD8_lxvPCYSdXiUrUDOqTbBB6EYO1b50wLuLPLPidzifyIs2qmKjcLj17rJK0TqLUFMRzG49IIcQ/s200/svd_4.png" width="200" /></a></span></div>
<div style="font-size: 14px; margin-left: 1em; margin-right: 1em; text-align: left;">
</div>
<div style="font-size: 14px; margin-left: 1em; margin-right: 1em; text-align: left;">
</div>
<div style="font-size: 14px; margin-left: 1em; margin-right: 1em; text-align: left;">
</div>
<div style="font-size: 14px; margin-left: 1em; margin-right: 1em; text-align: left;">
</div>
<div style="font-size: 14px; margin-left: 1em; margin-right: 1em; text-align: left;">
</div>
<div style="font-size: 14px;">
</div>
<pre style="border-radius: 0px; border: 0px; box-sizing: border-box; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><span style="font-family: "times" , "times new roman" , serif;">These charts tell us that the first matrix has a rank of 1. The second matrix has a rank above one, but does not have two full rows of uniqueness. This chart is a measurement of the correlation between rows and correlation between columns. The singular values of a matrix are a good first look at the amount of "information" or "signal" or "entropy". "Quantity of signal" is a fairly slippery concept with mathematical definitions that border on the metaphysical, but the singular values are a useful (though simplistic) measurement of the amount of information contained in a matrix.</span></pre>
<pre style="border-radius: 0px; border: 0px; box-sizing: border-box; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><span style="font-family: "times" , "times new roman" , serif;">Based on the term "entropy" in the previous paragraph, let's try filling a larger matrix with random numbers. This is the measurement of a 20x50 random matrix, with a chart in log scaled on the Y axis:</span></pre>
<pre style="border-radius: 0px; border: 0px; box-sizing: border-box; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><pre style="border-radius: 0px; border: 0px; box-sizing: border-box; font-size: 14px; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;">Rank = 20
Singular values: </pre>
<pre style="border-radius: 0px; border: 0px; box-sizing: border-box; font-size: 14px; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;">[0.35032972 0.322733 0.30419963 0.29269556 0.28137789 0.26201005
0.25742231 0.23075557 0.21945011 0.21763414 0.20245292 0.1976021
0.175207 0.16692658 0.14884194 0.14208821 0.13474842 0.12291441
0.10004981 0.08849001]</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfArilNq7YTW4_U1UelFzePqZTigWTLPfZ9YvnASrvDkam2I7IDdrImr8xEARzIx8nNOKmSv1S55o_eLCw3TITMeqFmK4d3mnc0ybo2Ql480hXUZwfvuIyOmDTmSDVo_ug5pKvk2g8T50/s1600/svd_20_50.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="288" data-original-width="432" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfArilNq7YTW4_U1UelFzePqZTigWTLPfZ9YvnASrvDkam2I7IDdrImr8xEARzIx8nNOKmSv1S55o_eLCw3TITMeqFmK4d3mnc0ybo2Ql480hXUZwfvuIyOmDTmSDVo_ug5pKvk2g8T50/s320/svd_20_50.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-family: "times" , "times new roman" , serif;">
</span></div>
<div class="separator" style="clear: both; text-align: left;">
<span style="font-family: "times" , "times new roman" , serif;">The graph should be a straight line but is not, due mainly to the fact "random numbers" are very hard to come by. This matrix has rank of 20- you cannot multiply & add any row and get another row. In fact, all rows are equidistant under </span><i style="font-family: Times, "Times New Roman", serif;">Euclidean distance</i><span style="font-family: "times" , "times new roman" , serif;">. Let's do something sneaky: we're going multiply it by itself to get a 50x50 matrix of random numbers. Here's are the measurements (again, a log-Y chart):</span></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-family: "times" , "times new roman" , serif;">
</span></div>
<pre style="border-radius: 0px; border: 0px; box-sizing: border-box; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><pre style="border-radius: 0px; border: 0px; box-sizing: border-box; font-size: 14px; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;">Rank = 20
[1.14271040e+02 9.69770480e+01 8.61587896e+01 7.97653909e+01
7.37160653e+01 6.39172571e+01 6.16984957e+01 4.95777248e+01
4.48387835e+01 4.40997633e+01 3.81619300e+01 3.63550994e+01
2.85815085e+01 2.59437792e+01 2.06268491e+01 1.87974231e+01
1.69055603e+01 1.40665565e+01 9.31997487e+00 7.29072452e+00
1.50274491e-14 1.29473245e-14 1.24897761e-14 1.11334989e-14
1.03937520e-14 9.90176181e-15 9.17746567e-15 8.94074462e-15
8.58805508e-15 7.94287894e-15 6.90962668e-15 6.72135898e-15
6.02639348e-15 5.66356533e-15 5.16429606e-15 4.84762468e-15
4.47711092e-15 4.37266408e-15 4.27780206e-15 3.43884568e-15
3.21440817e-15 2.83815151e-15 2.61710432e-15 2.34685343e-15
2.24348424e-15 1.32915906e-15 9.12392110e-16 5.65320744e-16
2.64346262e-16 1.11941235e-16]</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdhH2oU1EtmxTNTZpM3btKUa5lyzX5xLPkNwuxcnfZMhk20SWxPTHmWnjyVxPeNteEHCjxt6NfiJQAG6D4oaLboBZ-1EKlkYCXA3aj1ZG1BhmbogCA4T4TxwJrZfsM_5TNvOrTV2EfbvE/s1600/svd_50_50.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="288" data-original-width="432" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdhH2oU1EtmxTNTZpM3btKUa5lyzX5xLPkNwuxcnfZMhk20SWxPTHmWnjyVxPeNteEHCjxt6NfiJQAG6D4oaLboBZ-1EKlkYCXA3aj1ZG1BhmbogCA4T4TxwJrZfsM_5TNvOrTV2EfbvE/s320/svd_50_50.png" width="320" /></a></div>
<pre style="border-radius: 0px; border: 0px; box-sizing: border-box; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><span style="font-family: "times" , "times new roman" , serif;">Huh? Why is the rank still 20? And what's going on with that chart? <i>When we multiplied the matrix by itself, we did not add any information/signal/entropy to the matrix!</i> Therefore it still has the same amount of separability. It's as if we weighed a balloon, filled it with weightless air, then weighed it again.</span></pre>
<pre style="border-radius: 0px; border: 0px; box-sizing: border-box; line-height: inherit; overflow: auto; padding: 1px 0px; vertical-align: baseline; white-space: pre-wrap; word-break: break-all; word-wrap: break-word;"><span style="font-family: "times" , "times new roman" , serif;">In this post, we discussed SVD and the measurement of information in a matrix; this trick of multiplying a random matrix by itself to blow it up was what helped me understand SVD ten years ago. (Thanks to Ted Dunning on the Mahout mailing list.)</span></pre>
</pre>
</pre>
</pre>
</div>
<div>
I've fast-forwarded like mad through days of linear algebra lecture, in order to give intuitive understanding about matrix analysis. We'll be using SVD in a few ways in this series on my explorations in Neural Networks. See this page for a good explanation of SVD and a cool example of how it can be used to clean up photographs. :</div>
<div>
<a href="http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/">http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/</a></div>
<div>
Source code for the above charts & values available at:</div>
<div>
<a href="https://github.com/LanceNorskog/blob_posts/blob/master/SVD%20Basics.ipynb">https://github.com/LanceNorskog/blob_posts/blob/master/SVD%20Basics.ipynb</a></div>
Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-89409453075364830442018-04-15T18:36:00.000-07:002018-04-18T18:42:42.966-07:00Fun with SVD: MNIST DigitsSingular Value Decomposition is a very powerful linear algebra operation. It factorizes a matrix into three matrices which have some interesting properties. Among other things, SVD gives you a ranking of how 'typical' or 'normal' a row or column is in comparison to the other columns. I created a 100x784 matrix of gray scale values, where each row is a sample image and each column is a position in the 28x28 gray-scale raster. The following chart gives the 100 digits ranked from "outlier" to "central", with the normalized rank above each digit.<br />
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNcAX2AEhYfpU0IXR89AK9u1j1aSRlp_bv4oFnJZnPfmclOqBXNlRyTbi-HhgEEdGHB_vRVTwGwV4Z8D5ZjphPd067XKBZtrAHXL3GCHxcHIB-QEOSsUpFs1oDCzjBJE5dUD2VZdzB2lE/s1600/foo.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1062" data-original-width="851" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNcAX2AEhYfpU0IXR89AK9u1j1aSRlp_bv4oFnJZnPfmclOqBXNlRyTbi-HhgEEdGHB_vRVTwGwV4Z8D5ZjphPd067XKBZtrAHXL3GCHxcHIB-QEOSsUpFs1oDCzjBJE5dUD2VZdzB2lE/s640/foo.png" width="512" /></a></div>
<div>
<br /></div>
<div>
By this ranking, fuzzy/smudgy images are outliers and cleaner lines are central. Or, 4's are central. Here's the code:</div>
<div>
<br /></div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"># variation of code from Tariq's book)</span><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"># python notebook for Make Your Own Neural Network<br /># working with the MNIST data set<br />#<br /># (c) Tariq Rashid, 2016<br /># license is GPLv2</span><br />
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">import numpy as np</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">import matplotlib.pyplot as py</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">%matplotlib inline</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"># strangely, does not exist in numpy</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">def normalize(v):</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> max_v = -10000000</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> min_v = 10000000</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> for x in v:</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> if (x > max_v):</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> max_v = x</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> if (x < min_v):</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> min_v = x</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> scale = 1.0/(max_v - min_v)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> offset = -min_v</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> for i in range(len(v)):</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> v[i] = (v[i] + offset) * scale </span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> return v</span></div>
</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><br /></span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"># open the CSV file and read its contents into a list</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">data_file = open("mnist_dataset/mnist_train_100.csv", 'r')</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">data_list = data_file.readlines()</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">data_file.close()</span></div>
</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">rows = len(data_list)</span></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">image_mat = np.zeros((rows, 28 * 28))</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">for row in range(rows):</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> dig = data_list[row][0]</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> all_values = data_list[row].split(',')</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> image_vector = np.asfarray(all_values[1:])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> image_mat[row] = (image_vector / 255.0 * 0.99) + 0.01</span></div>
</div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">(u, s, v) = np.linalg.svd(image_mat)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">row_features = normalize(u.dot(s))</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"># py.plot(np.sort(row_features))</span></div>
</div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">keys = np.argsort(row_features)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">grid=10</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">fig,axes = py.subplots(nrows=rows//grid, ncols=grid)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">fig.set_figheight(15)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">fig.set_figwidth(15)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">py.subplots_adjust(top=1.1)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">for row in range(rows):</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> ax = axes[row//grid][row%grid]</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> ax.set_title("{0:.2f}".format(row_features[keys[row]]), fontsize=10)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> ax.set_xticks([])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> ax.set_yticks([])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> ax.imshow(image_mat[keys[row]].reshape(28,28), cmap='Greys', interpolation='None')</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">fig.savefig('foo.png', bbox_inches='tight')</span></div>
</div>
<div>
<br /></div>
</div>
Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-587139504812903742018-04-08T19:19:00.001-07:002018-04-08T19:19:57.137-07:00Document Summarization with LSA Appendix B: Software<h3>
The Test Harness</h3>
The scripts for running these tests and the data are in my github repo: <a href="https://github.com/LanceNorskog/lsa">https://github.com/LanceNorskog/lsa</a>.<br />
<h4>
LSA toolkit</h4>
The LSA toolkit is available under <a href="https://github.com/LanceNorskog/lsa/tree/master/research">https://github.com/LanceNorskog/lsa/tree/master/research</a>. The Solr code using the LSA library is not yet published. It is a hacked-up terror intertwined with my port of OpenNLP to Solr. I plan to create a generic version of the Solr Summarizer that directly uses a Solr text type rather than its own custom implementation of OpenNLP POS filtering. The OpenNLP port for Lucene/Solr is available as <a href="https://issues.apache.org/jira/browse/LUCENE-2899" target="_blank">LUCENE-2899</a>.<br />
<h4>
Reuters Corpus</h4>
The Reuters data and scripts for this analysis project are under <a href="https://github.com/LanceNorskog/lsa/tree/master/reuters" target="_blank">https://github.com/LanceNorskog/lsa/tree/master/reuters</a>. .<a href="https://github.com/LanceNorskog/lsa/tree/master/reuters/data/raw">..../data/raw</a> is the Reuters article corpus preprocessed: the articles are reformatted into one sentence per line and are limited to 10+ sentences. The toolkit includes a script to run against the Solr Document Summarizer and save the XML output for each article, and a script to apply XPath expressions to create a CSV line for each article into one CSV file per algorithm. The per-algorithm keys include both the regularization algorithms and whether parts-of-speech filtering was applied.<br />
<h4>
Analysis</h4>
The analysis phase used KNime to preprocess the CSV data. KNime rules created more columns which were calculated from the generated columns, and then to create pivot table which summarized the data per algorithm. This data was saved into a new CSV file. KNime's charting facilities are very limited, so I used an Excel script to generate the charts. Excel 2010 failed on my Mac, and I had to make the charts in LibreOffice instead, but then copy them into a DOC file in MS Word (and not LibreOffice!) to get just plain jpegs from the charts.<br />
<h3>
Further Reading</h3>
<div>
The KNime data analysis toolkit is my favorite tool for exploring numbers. It is a visual programming UI (based on the Eclipse platform) which allows you to hook up statistics jobs, file I/O, Java code scriptlets and interactive graphs. Highly recommended for amateurs and the occasional user who cannot remember all of R.</div>
<div>
<a href="http://knime.org/">KNime.org</a></div>
<div>
<br /></div>
<div>
The R platform is a massive library for scientific computing. I did the SVD example charts on the first page with R. R is big; I suggest using RStudio and the 'R Project Template' software.</div>
<div>
<a href="http://www.r-project.org/">http://www.r-project.org/</a><br />
<a href="http://rstudio.org/" target="_blank">http://rstudio.org/</a><br />
<a href="http://www.johnmyleswhite.com/notebook/2010/08/26/projecttemplate/">http://www.johnmyleswhite.com/notebook/2010/08/26/projecttemplate/</a></div>
<div>
<br /></div>
Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com1tag:blogger.com,1999:blog-3095049372219674792.post-59169763494454643012018-04-08T19:19:00.000-07:002018-04-08T19:19:44.309-07:00Document Summarization with LSA Part 0: Basics<div>
This is an investigation into analyzing documents with Latent Semantic Analysis, or LSA. In particular, we will find the key sentences and tag clouds in a technical paper with Singular Value Decomposition, inspired by this paper: </div>
<div>
<blockquote class="tr_bq">
<in><a href="http://scholar.google.com/scholar?hl=en&q=%22Creating+Generic+Text+Summaries.%22+gong"><span style="font-size: large;">Creating Generic Text Summaries</span></a></in></blockquote>
</div>
The application is add summaries to the <a href="http://www.lucidimagination.com/search" target="_blank">LucidFind</a> application, a search engine devoted to open-source text processing projects.<br />
<h3>
Core concepts</h3>
<h4>
Document Corpus</h4>
A corpus is just a collection of documents. There are many standardized text document corpuses out there used a data in research, the way standardized white mice strains are used in laboratory research on RAs. In this case, we use the sentences in one document as the corpus for that document. The matrix has sentences for rows and terms for columns.<br />
<h4>
Term Vector</h4>
A <b>term vector</b> is a "bag of words" with numbers attached. We are going to take all of the words in the document corpus and create a matrix where documents are on one side and words are on the other, and add a value for each word where it occurs in a document. The standard ways are to add a count, or just set it to one. This is a matrix where the rows are term vectors for the documents. This gives a numerical representation of the document corpus which has all grammatical information ripped out. This <i>Bayesian</i> representation is surprisingly effective, as we will see.<br />
<h4>
Singular Value Decomposition</h4>
<b>SVD</b> is a linear algebra operation. Given a matrix, it creates three matrices based on the original matrix. The one we care about is the "left feature matrix", commonly called U in the literature. The rows are "document feature vectors", which represent a kind of importance for each document. A <i>feature</i>, in machine learning, refers to signal in noisy data. Applied to a matrix of sentences v.s. terms, the document with the largest value in the left-hand column is the most "important" in the corpus. The remaining values in the row also contribute to importance.<br />
<h4>
Tying it together</h4>
LSA makes a bag-of-words matrix from the document corpus, and then tries to find the documents that most reflect the "essence" of the corpus. In this project we try to find the most essential sentences in a technical paper by doing a linear algebra analysis using both the sentences and individual words.Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com1tag:blogger.com,1999:blog-3095049372219674792.post-29433319216871967062012-09-06T04:42:00.001-07:002012-09-06T04:42:14.844-07:00Document Summarization with LSA Appendix: Tuning<h2>
Two problems</h2>
<div>
I found a couple of problems with this approach: very long sentences and running time & space.</div>
<h3>
Very Long Sentences</h3>
<div>
Other test data I looked at were technical papers. These often have very long sentences. The problem is not controlling for sentence length, but that there are too many theme words for the sentence to express one theme well. One could try breaking these up into sentence shards: a 50-word sentence would become three 15-word shards. If we use parts-of-speech trimming, we can break the shards within runs of filler words.</div>
<h3>
Optimizations</h3>
It is possible to create an optimized version of LSA that only ranks sentences or terms via Random Indexing. RI deliberately throws away the identity of one side of the sentence/term matrix. It can rank sentences or terms, but not both. This algorithm runs much faster than the full sentence/term version, and is the next step in creating an interactive document summarizer or a high-quality tag cloud generator.<br />
<br />
<div class="p1">
<h4>
Random Projection</h4>
Random Projection is a wonderfully non-intuitive algorithm. If you multiply a matrix with another matrix filled with random numbers, the output matrix will have the same cosine distance ratios across all pairs of vectors. In a document-term matrix, all of the documents will still have the same mutual 'relevance' (measured by cosine distance which magically matches tf-idf relevance ranking).<br />
<br />
If the random matrix is the (transposed) size of the original matrix, the row and column vectors will have the above distance property. If the random matrix has more or fewer dimensions on one side, the resulting matrix will retain the identity of vectors from the constant side, but will lose the identity of the vectors on the varying side. If you have a sentence-term matrix of 50 sentences x 1000 terms, and you multiply it by a random matrix of 100 x 50, you get a matrix of 50 sentences x 100 "sketch" vectors, where every sketch is a low-quality summarization of the term vectors. The 50 sentences will still have the same cosine ratios; we have merely thrown away the identity of the term vectors. Since the running time of SVD is very non-linear, we now have a much faster dataset that will give us the orthogonal decomposition of the sentences (but the terms are forgotten). We can invert this and calculate the SVD for terms without sentences.<br />
<h4>
Random Indexing</h4>
The above description posits creating the entire sentence/term matrix, then doing the complete multiplication. In fact, you can create the resulting sentence/sketch matrix directly, one term at a time. This will considerably cut memory usage and running time. I include this explanation because there are few online sources describing Random Indexing, and the following forgets to explain RI's roots in Random Projection.<br />
<a href="http://www.sics.se/~mange/random_indexing.html">http://www.sics.se/~mange/random_indexing.html</a><br />
<a href="http://www.sics.se/~mange/papers/RI_intro.pdf">http://www.sics.se/~mange/papers/RI_intro.pdf</a><br />
<h4>
Fast Random Projection</h4>
Also see Achlioptas et. al. for a wicked speed-up of random indexing that is very suitable for densely populated data.<br />
<a href="http://users.soe.ucsc.edu/~optas/papers/jl.pdf" target="_blank">http://users.soe.ucsc.edu/~optas/papers/jl.pdf</a></div>
Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-1346528202862711652012-09-06T04:41:00.002-07:002012-09-06T04:52:08.508-07:00Document Summarization with LSA #5: Software<h3>
The Test Harness</h3>
The scripts for running these tests and the data are in my github repo: <a href="https://github.com/LanceNorskog/lsa">https://github.com/LanceNorskog/lsa</a>.<br />
<h4>
LSA toolkit and Solr DocumentSummarizer class</h4>
The LSA toolkit is available under <a href="https://github.com/LanceNorskog/lsa/tree/master/research">https://github.com/LanceNorskog/lsa/tree/master/research</a>. The Solr code using the LSA library is not yet published. It is a hacked-up terror intertwined with my port of OpenNLP to Solr. I plan to create a generic version of the Solr Summarizer that directly uses a Solr text type rather than its own custom implementation of OpenNLP POS filtering. The OpenNLP port for Lucene/Solr is available as <a href="https://issues.apache.org/jira/browse/LUCENE-2899" target="_blank">LUCENE-2899</a>.<br />
<br />
The Summarizer optionally uses OpenNLP to do sentence parsing and parts-of-speech analysis. It uses the OpenNLP parts-of-speech tool to filter for nouns and verbs, dropping all other words. Previous experiments used both raw sentences and sentences limited to nouns & verbs, and pos-stripped sentences worked 10-15% better in every algorithm combination. This set of benchmarks did not bother to try the full sentences.<br />
<h4>
Reuters Corpus</h4>
The Reuters data and scripts for this analysis project are under <a href="https://github.com/LanceNorskog/lsa/tree/master/reuters" target="_blank">https://github.com/LanceNorskog/lsa/tree/master/reuters</a>. .<a href="https://github.com/LanceNorskog/lsa/tree/master/reuters/data/raw">..../data/raw</a> is the Reuters article corpus preprocessed: the articles are reformatted into one sentence per line and are limited to 10+ sentences. The toolkit includes a script to run against the Solr Document Summarizer and save the XML output for each article, and a script to apply XPath expressions to create a CSV line for each article into one CSV file per algorithm. The per-algorithm keys include both the regularization algorithms and whether parts-of-speech filtering was applied.<br />
<h4>
Analysis</h4>
The analysis phase used KNime to preprocess the CSV data. KNime rules created more columns which were calculated from the generated columns, and then to create pivot table which summarized the data per algorithm. This data was saved into a new CSV file. KNime's charting facilities are very limited, so I used an Excel script to generate the charts. Excel 2010 failed on my Mac, and I had to make the charts in LibreOffice instead, but then copy them into a DOC file in MS Word (and not LibreOffice!) to get just plain jpegs from the charts.<br />
<h3>
Further Reading</h3>
<div>
The KNime data analysis toolkit is my favorite tool for exploring numbers. It is a visual programming UI (based on the Eclipse platform) which allows you to hook up statistics jobs, file I/O, Java code scriptlets and interactive graphs. Highly recommended for amateurs and the occasional user who cannot remember all of R.</div>
<div>
<a href="http://knime.org/">KNime.org</a></div>
<div>
<br /></div>
<div>
The R platform is a massive library for scientific computing. I did the SVD example charts on the first page with R. R is big; I suggest using RStudio and the 'R Project Template' software.</div>
<div>
<a href="http://www.r-project.org/">http://www.r-project.org/</a><br />
<a href="http://rstudio.org/" target="_blank">http://rstudio.org/</a><br />
<a href="http://www.johnmyleswhite.com/notebook/2010/08/26/projecttemplate/">http://www.johnmyleswhite.com/notebook/2010/08/26/projecttemplate/</a></div>
<div>
</div>
<h3>
</h3>
<h3>
Next post: <a href="http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa.html">further tuning</a></h3>
<br />Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-30707565216324444602012-09-06T04:41:00.000-07:002012-09-15T19:42:54.467-07:00Document Summarization with LSA #4: Individual measurements<h3>
The Measurements</h3>
<div class="p1">
In this Post we review the individual measures. These charts show each measure applied to all the algorithms, and also the basis statistics summary of the full dataset (not the per-algorithm aggregates). The measures are MRR, Rating, Sentence Length, and Non-zero. MRR is a common method for evaluating search results and other kinds of recommendations. The other three are fabricated for this analysis.<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<h4>
Mean Reciprocal Rank</h4>
<div>
MRR is a common measure for search results. It attempts to model the unforgiving way in which users react to mistakes in the order of search results. It measures the position of a preferred search result in a result list. If the "right" answer is third in the list, the MRR is 1/3. </div>
<div>
<br /></div>
<div>
This statistic is the mean of three MRRs, one for each result based on how far it is from where it should be. If the second sentence is #2, that is a 1. If it is #1 or #3, that is 1/2. If the third is #3, that is a 1. If it is #2, that is 1/2 and if it is #1, that is 1/3. The measures go down to the 5th sentence.</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-ki58ZJ1AZiQ/UECKC5lTjZI/AAAAAAAAAMo/4S85XG7NzZc/s1600/MRR.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="193" src="http://2.bp.blogspot.com/-ki58ZJ1AZiQ/UECKC5lTjZI/AAAAAAAAAMo/4S85XG7NzZc/s400/MRR.jpg" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<h4 style="text-align: start;">
Rating (0-5)</h4>
<div style="text-align: start;">
Rating is a heavily mutated form of "Precision@3". It tries to model how the user reacts in a UI that shows snippets of the top three recommended sentences. 0 means no interest. 1 means at least one sentence placed (the Non-zero measure). 2-5 measure how well the first three recommendations match the first and second spots. In detail:</div>
<div style="text-align: start;">
5: first and second results are lede and subordinate<br />
4: first result is lede<br />
3: second result is lede<br />
2: first or second are within first three sentences<br />
1: third result is within first three sentences<br />
0: anything else</div>
<div style="text-align: start;">
<br /></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-754Q6FZynZ8/UECKDtaUmOI/AAAAAAAAAM0/aeOn7g36H88/s1600/Rating.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="193" src="http://1.bp.blogspot.com/-754Q6FZynZ8/UECKDtaUmOI/AAAAAAAAAM0/aeOn7g36H88/s400/Rating.jpg" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
<br class="Apple-interchange-newline" />
MRR and Rating (green and yellow) correlate very consistently in the graphic on Post #3. Rating tracks with MRR, but is more extreme. Note the wider standard deviation. This indicates that the Rating formula is a good statistic for modeling unforgiving users.</div>
<h4 style="text-align: start;">
Sentence Length</h4>
<div style="text-align: start;">
This measures the mean sentence length of the top two recommended sentences. The sentence length is the number of nouns and verbs in the sentence, not the number of words. This indicates how well the algorithm compensates for the dominance of longer sentences.</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-0UZorVQtxm0/UECKDvvlz9I/AAAAAAAAAM8/RxRosIB9Mzw/s1600/Sentence_Length.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="196" src="http://4.bp.blogspot.com/-0UZorVQtxm0/UECKDvvlz9I/AAAAAAAAAM8/RxRosIB9Mzw/s400/Sentence_Length.jpg" width="400" /></a></div>
<h4>
Non-Zero</h4>
<div>
Every result that recommended the first, second or third sentence for one of the three top spots, by percentage.</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-Lnt5zXy8BBU/UECKC8FBsNI/AAAAAAAAAMk/QKFwAVS0glA/s1600/Nonzero_percentage.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="193" src="http://1.bp.blogspot.com/-Lnt5zXy8BBU/UECKC8FBsNI/AAAAAAAAAMk/QKFwAVS0glA/s400/Nonzero_percentage.jpg" width="400" /></a></div>
The mean sentence length in the corpus is (I think) 22 sentences. A mean of 60 for 3 out of 22 is much better than random recommendations.<br />
<h3>
Precision v.s. Recall</h3>
<h4>
<span style="font-weight: normal;">In Information Retrieval jargon, precision is the accuracy of a ranking algorithm, and recall is the ability to find results. The three success measurements are precision measures. The "dartboard" measure is a recall measure.</span></h4>
<h4>
<span style="font-weight: normal;">From reading the actual sentences and recommendations, binary+normal and augNorm+normal had pretty good precision. These two also achieved the best recall at around 65%. This level would not be useful in a document summarization UI. I would pair this with a looser strategy to find related sentences by searching with the theme words. </span></h4>
<h3>
Previous Example</h3>
<div>
In the example in <a href="http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-2-test.html">Post #2</a>, the three top-rated sentences were 4, 3, and 6. Since only one of three made the cut, the rating algorithm gave this a three. Note that the example was not processed with Parts-of-Speech removal and used the binary algorithm, and still hit the dartboard. This article is the first in the dataset, and was chosen effectively at random.</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div style="text-align: start;">
<h3>
Further reading</h3>
</div>
<div style="text-align: start;">
<a href="http://en.wikipedia.org/wiki/Mean_reciprocal_rank">http://en.wikipedia.org/wiki/Mean_reciprocal_rank</a></div>
<br />
<h3>
Next post: <a href="http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-5.html">software details</a></h3>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br /></div>
Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-28179573638616594942012-09-06T04:34:00.001-07:002012-09-06T05:03:55.117-07:00Document Summarization with LSA #3: Preliminary results<h3>
Overall Analysis</h3>
<h4>
Measures are tuned for Interactive UI</h4>
This implementation is targeted for interactive use in search engines. A search UI usually has the first few results shown in ranked order, with the option to go to the next few results. This UI is intended to show the first three ranked sentences at the top of an entry with the theme words highlighted. Users are not forgiving of mistakes in these situations. The first result is much more important than the second, and so forth. People rarely click through to the second page.<br />
<br />
The measures of effectiveness are formulated with this in mind. We used three:<br />
<ol>
<li>A variant of Mean Reciprocal Rank (MRR).</li>
<li>"Rating" is a measure we created to model the user's behavior in a summarization UI. Our MMR variant and Rating are defined in the next Post. </li>
<li>Non-zero counts whether the algorithm placed any recommendations in the top three. <i>"Did we even hit the dartboard?"</i></li>
</ol>
A separate problem is that sentences with more words can dominate the Sentence Length measures the length of the highest rated sentences. In this chart "Inverse Length" measures how well the algorithm countered the effects of sentence length.<br />
<div class="p1">
<h4>
Overall Comparison Chart</h4>
</div>
<div class="p1">
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-dcHI2mVXxlQ/UDNGUmhdVuI/AAAAAAAAAKE/9D6HR9nqtqc/s1600/Overall.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-dcHI2mVXxlQ/UDNGUmhdVuI/AAAAAAAAAKE/9D6HR9nqtqc/s1600/Overall.jpg" /></a></div>
<br />
<ul>
<li>Key to algorithm names: "binary_normal" means that "binary" was used to create each cell, while "normal" multiplied each term vector with the mean normalized term vector. If there is no second key, the global value was 1. See post #1 for the full list of algorithms.</li>
</ul>
This display is a normalized version of the mean results for all 24 algorithm pairs, with four different measures. In all four, higher is better. "Inverse Length" means "how well it suppresses the length effect", "Rating" is the rating algorithm described above, "MRR" is our variant implementation of Mean Reciprocal Rank, and >0 is the number of results where any of the first three were in the first three sentences. None of these are absolutes, and the scales do not translate between measures. They simply show a relative ranking for each algorithm pair in the four measures: compare green to green, etc. The next post gives the detailed measurements in real units.<br />
<h4>
Grand Revelation</h4>
The grand revelation is: <i>always normalize the term vector!</i> All 5 local algorithms worked best with normal as the global algorithm. The binary function truncates the term counts to 1. Binary plus normalizing the term vector was by far the best in all three success measurements, and was middling in counteracting sentence length. AugNorm + normal was the highest achiever which compensates well for sentence length. TF + normal was the best overall for sentence length, but was only average for the three effectiveness measures.<br />
<h3>
Next post: <a href="http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-4.html">detailed analysis</a></h3>
<br /></div>
Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-85409114602031335262012-09-06T04:33:00.003-07:002012-09-15T19:43:01.863-07:00Document Summarization with LSA #2: Test with newspaper articles<h3>
The Experiment</h3>
This analysis evaluates many variants of the LSA algorithm against some measurements appropriate for an imaginary document summarization UI. This UI displays the most important two sentences with the important theme words highlighted. The measurements try to match the expectations of the user.<br />
<h4>
Supervised v.s. Unsupervised Learning</h4>
<div class="p1">
Machine Learning algorithms are classified as <i>supervised</i>, <i>unsupervised</i>, and <i>semi-supervised</i>.<br />
A supervised algorithm creates a <i>model</i> (usually statistical) from <i>training data</i>, then applies <i>test data</i> against the model. An unsupervised algorithm is applied directly against test data without a pre-created model. A semi-supervised algorithm uses tagged and untagged data; this works surprisingly well in a lot of contexts.<br />
<h3>
The Data</h3>
We are going to use unsupervised learning. The test data is a corpus of newspaper articles called Reuters-21758, which was collected and published by the Reuters news agency to assist text analysis research. This dataset is not really tagged, but is appropriate for this experiment. Newspaper articles are written in a particular style which is essentially pre-summarized. In a well-written newspaper article, the first sentence (the <i>lede</i>) is the most important sentence, and the second sentence is complementary and usually shares few words with the first. The rest of the article is usually structured in order from abstraction to detail, called the Inverted Pyramid form. And, newspaper articles are the right length to summarize effectively. </div>
<div class="p2">
<h4>
Example Data</h4>
</div>
<div class="p1">
We limited the tests to sentences which were from 15-75 sentences long. The entire corpus is 21 thousand articles. This limits the test space to just under 2000. Here is a sample newspaper article:<br />
<blockquote class="tr_bq">
<blockquote class="tr_bq">
<span style="font-size: x-small;"><b>26-FEB-1987 15:26:26.78</b></span></blockquote>
<blockquote class="tr_bq">
<span style="font-size: x-small;"><b>DEAN FOODS <DF> SEES STRONG 4TH QTR EARNINGS</b></span></blockquote>
<blockquote class="tr_bq">
<b><span style="font-size: x-small;">Dean Foods Co expects earnings for the fourth quarter ending May 30 to exceed those of the same year-ago period, Chairman Kenneth Douglas told analysts. </span><span style="font-size: x-small;">In the fiscal 1986 fourth quarter the food processor reported earnings of 40 cts a share. </span><span style="font-size: x-small;">Douglas also said the year's sales should exceed 1.4 billion dlrs, up from 1.27 billion dlrs the prior year. </span><span style="font-size: x-small;">He repeated an earlier projection that third-quarter earnings "will probably be off slightly" from last year's 40 cts a share, falling in the range of 34 cts to 36 cts a share. </span><span style="font-size: x-small;">Douglas said it was too early to project whether the anticipated fourth quarter performance would be "enough for us to exceed the prior year's overall earnings" of 1.53 dlrs a share. </span><span style="font-size: x-small;">In 1988, Douglas said Dean should experience "a 20 pct improvement in our bottom line from effects of the tax reform act alone."</span></b></blockquote>
<blockquote class="tr_bq">
<b><span style="font-size: x-small;">President Howard Dean said in fiscal 1988 the company will derive benefits of various dairy and frozen vegetable acquisitions from Ryan Milk to the Larsen Co. </span><span style="font-size: x-small;">Dean also said the company will benefit from its acquisition in late December of Elgin Blenders Inc, West Chicago. </span><span style="font-size: x-small;">He said the company is a major shareholder of E.B.I. Foods Ltd, a United Kingdom blender, and has licensing arrangements in Australia, Canada, Brazil and Japan. </span><span style="font-size: x-small;">"It provides an entry to McDonalds Corp <MCD> we've been after for years," Douglas told analysts. Reuter &#3;</span></b></blockquote>
</blockquote>
</div>
<div class="p1">
As you can see, the text matches the concept of the inverted pyramid. The first two sentences are complementary, have no repeated words, and few words in common. Repeated concepts are described with different words: "Dean Foods Co" in the first sentence is echoed as "the food processor" in the second, while "expects earnings" is matched by "recorded earnings". This style seems real-world enough to be good test data for this algorithm suite. We did not comb the data for poorly structured articles or garbled text.<br />
<h3>
The Code</h3>
<div>
There are two bits of code involved: Singular Value Decomposition (explained previously) and Matrix "Regularization", or "Conditioning". The latter refers to applying non-linear algorithms to the document-term matrix which make the data somewhat more amenable to analysis. Several algorithms have been investigated for this purpose. </div>
<div>
<br /></div>
<div>
The raw term counts data supplied by a document/term matrix may not always be the right way to approach the data. Matrix Regularization algorithms are functions which use the entire dataset to affect each cell in the matrix. The contents of a document/term matrix after regularization are referred to as "term weights".<br />
<br />
There are two classes of algorithm for creating term weights. <i>Local</i> algorithms alter each cell, while <i>global</i> algorithms create a global vector of values per term that are applied to all documents with that term, and likewise a global vector of values per document which is applied to all terms in that document. Local algorithms include <i>term frequency</i> (tf) which uses the raw matrix, <i>binary</i> which replaces each term count greater than one with a one, and some others which find a unique value for a cell based on the document and term vectors which cross at that cell.<br />
<br />
Global algorithms for term vectors include normalizing the term vector and finding the inverse document frequency (idf) of the term. The LSA implementation includes implementations of these local and global algorithms, and any pair can be used together. Thus, tf-idf is achieved by combining the local tf algorithm and the global idf algorithm. The literature recommends a few of these combinations (tf-idf and log-entropy) as the most effective, but this test found a few other combinations which were superior. For document vectors, cosine normalization and a new "pivoted cosine" normalization are recommended for counteracting the dominance of longer sentences. These are not yet implemented. Existing combinations of local and term vector algorithms do a good job of suppressing sentence length problems. We will see later on that:<br />
<ul>
<li>one of the term vector algorithms is by far the best at everything, </li>
<li>one of the local algorithms is the overall winner, and </li>
<li>one of the others does a fantastic job at sentence length problems but is an otherwise average performer.</li>
</ul>
<h3>
The Result</h3>
The above article gave the following result for the 'binary' algorithm:</div>
<div>
<blockquote class="tr_bq">
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><lst name="analysis"><br /> <lst name="summary"><br /> <lst name="stats"><br /> <lst name="sentences"><br /> <int name="count">12</int><br /> </lst><br /> <lst name="terms"><br /> <int name="count">150</int><br /> </lst><br /> <lst name="stems"><br /> <int name="count">150</int><br /> </lst><br /> </lst><br /> <lst name="terms"><br /> <lst name="term"><br /> <str name="text">the</str><br /> <double name="strength">1.0</double><br /> </lst><br /> <lst name="term"><br /> <str name="text">of</str><br /> <double name="strength">0.942809041582063</double><br /> </lst><br /> <lst name="term"><br /> <str name="text">said</str><br /> <double name="strength">0.8164965809277259</double><br /> </lst><br /> <lst name="term"><br /> <str name="text">from</str><br /> <double name="strength">0.7453559924999301</double><br /> </lst><br /> <lst name="term"><br /> <str name="text">Douglas</str><br /> <double name="strength">0.7453559924999298</double><br /> </lst><br /> </lst><br /> <lst name="sentences"><br /> <lst name="sentence"><br /> <str name="text">Douglas said it was too early to project whether the anticipated fourth quarter performance would be "enough for us to exceed the prior year's overall earnings" of 1.53 dlrs a share.</str><br /> <int name="index">4</int><br /> <int name="start">533</int><br /> <int name="end">715</int><br /> <double name="strength">1.0</double><br /> <int name="terms">29</int><br /> </lst><br /> <lst name="sentence"><br /> <str name="text">He repeated an earlier projection that third-quarter earnings "will probably be off slightly" from last year's 40 cts a share, falling in the range of 34 cts to 36 cts a share.</str><br /> <int name="index">3</int><br /> <int name="start">355</int><br /> <int name="end">531</int><br /> <double name="strength">0.999999999999999</double><br /> <int name="terms">29</int><br /> </lst><br /> <lst name="sentence"><br /> <str name="text">President Howard Dean said in fiscal 1988 the company will derive benefits of various dairy and frozen vegetable acquisitions from Ryan Milk to the Larsen Co.</str><br /> <int name="index">6</int><br /> <int name="start">847</int><br /> <int name="end">1006</int><br /> <double name="strength">0.9284766908852594</double><br /> <int name="terms">25</int><br /> </lst><br /> </lst><br /> <lst name="highlighted"><br /> <lst name="sentence"><br /> <str name="text">&lt;em&gt;Douglas&lt;/em&gt; &lt;em&gt;said&lt;/em&gt; it was too early to project whether &lt;em&gt;the&lt;/em&gt; anticipated fourth quarter performance would be "enough for us to exceed &lt;em&gt;the&lt;/em&gt; prior year's overall earnings" &lt;em&gt;of&lt;/em&gt; 1.53 dlrs a share.</str><br /> <int name="index">4</int><br /> </lst><br /> <lst name="sentence"><br /> <str name="text">He repeated an earlier projection that third-quarter earnings "will probably be off slightly" &lt;em&gt;from&lt;/em&gt; last year's 40 cts a share, falling in &lt;em&gt;the&lt;/em&gt; range &lt;em&gt;of&lt;/em&gt; 34 cts to 36 cts a share.</str><br /> <int name="index">3</int><br /> </lst><br /> <lst name="sentence"><br /> <str name="text">President Howard Dean &lt;em&gt;said&lt;/em&gt; in fiscal 1988 &lt;em&gt;the&lt;/em&gt; company will derive benefits &lt;em&gt;of&lt;/em&gt; various dairy and frozen vegetable acquisitions &lt;em&gt;from&lt;/em&gt; Ryan Milk to &lt;em&gt;the&lt;/em&gt; Larsen Co.</str><br /> <int name="index">6</int><br /> </lst><br /> </lst><br /> </lst><br /></lst></span></blockquote>
</div>
<div>
Because the analyzer does not remove extraneous words, they form the most common theme words. But notice that "Douglas said" is a common phrase and these two words are in the top 5. Theme words are not common words, they are words which are used together. Theme sentences are effectively those with the most and strongest theme words.<br />
<br /></div>
<div>
The Summarizer tool can return the most important sentences, and can highlight the most important theme words. This can be used to show highlighted snippets in a UI.</div>
<div>
<h3>
Further Reading</h3>
</div>
<a href="http://en.wikipedia.org/wiki/Machine_learning" target="_blank">Machine Learning</a><br />
<a href="http://en.wikipedia.org/wiki/Supervised_learning" target="_blank">Supervised Learning</a><br />
<a href="http://en.wikipedia.org/wiki/Unsupervised_learning" target="_blank">Unsupervised Learning</a><br />
<a href="http://en.wikipedia.org/wiki/Semi-supervised_learning" target="_blank">Semi-Supervised Learning</a><br />
<a href="http://en.wikipedia.org/wiki/Inverted_pyramid" target="_blank">Inverted Pyramid</a><br />
<a href="http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection" target="_blank">Reuters 21578 text corpus</a><br />
<h3>
Next post: <a href="http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-3.html" target="">the overall results</a></h3>
<div>
<br /></div>
<br /></div>
Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-30353261887081578462012-09-06T04:33:00.002-07:002012-09-15T19:37:21.064-07:00Document Summarization with LSA #1: Introduction<h3>
Document Summarization with LSA</h3>
<div class="p1">
<div class="p1">
</div>
This is a several-part series on document summarization using Latent Semantic Analysis (LSA). I wrote a document summarizer and did an exhaustive measurement pass using it to summarize newspaper articles from the first Reuters corpus. The code is structured as a web service in Solr, using Lucene for text analysis and the OpenNLP package for tuning the algorithm with Parts-of-Speech analysis.<br />
<h4>
Introduction</h4>
<div class="p1">
</div>
Document summarization is about finding the "themes" in a document: the important words and sentences that contain the core concepts. There are many algorithms for document summarization. This algorithm uses Latent Semantic Analysis, which uses linear algebra to analyze how words and sentences are used in common. LSA is based on the "bag of words" concept of "term vectors", or a list of all words and how often it are used in each document. LSA uses Singular Value Decomposition (SVD) to tease out which words are used the most with other words, and which sentences use the most theme words together. Document Summarization with LSA uses SVD to give us main and secondary sentences which have the strongest collections of theme words and yet a minimal number of theme words in common.</div>
<div class="p1">
<br />
Every document has words which express the themes of the document. These words are frequent, but they are also used together in sentences. We want to find the most important sentences and words; the main sentences should be shown as the summary, and the most important words are good search words for this document.<br />
<br /></div>
<div class="p2">
<h4>
Orthogonal Sentences</h4>
</div>
<div class="p2">
A key idea is that the most important and second most important sentences in a document are independent: they tend to share few words. The most important sentence in a document expresses the main theme of the document, and the second most important uses other theme words to elaborate on the main sentence. When we express the sentences and words in a bag-of-words matrix, SVD can analyze the sentences and words in relation to each other, by how the terms are used together in sentences. It creates a sorted list of documents which are as orthogonal as possible: which means that their collective theme words are as different as possible.<br />
<br />
Note that since this technique treats documents and terms symmetrically, it also creates a sorted list of terms by how important they are- this makes for a good tag cloud.<br />
<br />
<h3>
Example</h3>
<div>
Here is an example of the first two sentences from a financial newswire article.</div>
<blockquote class="tr_bq">
<b><span style="font-size: x-small;">Dean Foods Co expects earnings for the fourth quarter ending May 30 to exceed those of the same year-ago period, Chairman Kenneth Douglas told analysts. </span><span style="font-size: x-small;">In the fiscal 1986 fourth quarter the food processor reported earnings of 40 cts a share.</span></b></blockquote>
<div>
The first sentence is long, and expresses the theme of the article. The second sentence elaborates on the first sentence, and does not share any real words except <i>earnings.</i> Note that in order to avoid repeating the company name, and to give more information, the second sentence elaborates on the first sentence and refers to Dean Foods as <i>the food processor</i>. </div>
<div>
<br /></div>
<h4>
Further Reading</h4>
</div>
<div class="p1">
This is an important paper on using SVD to summarize documents. It appears to be the original proposal for this technique:</div>
<div class="p1">
</div>
<div class="p1">
<i>Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis</i></div>
<div class="p1">
<i>Gong</i> <i>and</i> <i>Liu, 2002</i></div>
<div class="p1">
<a href="http://www.cs.bham.ac.uk/~pxt/IDA/text_summary.pdf">http://www.cs.bham.ac.uk/~pxt/IDA/text_summary.pdf</a><br />
<br />
Here are two good tutorials on Singular Value Decomposition and LSA:<br />
<a href="http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html">Singular Value Decomposition (SVD) Tutorial</a><br />
<a href="http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html">Latent Semantic Analysis (LSA) Tutorial</a><br />
<h3>
Next post: <a href="http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-2-test.html" target="_blank">the experiment explained</a></h3>
</div>
<br />
<div class="p1">
<br /></div>
Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-29632202985908662472012-08-21T01:08:00.001-07:002012-08-21T01:08:05.164-07:00SentenceLength<div style="margin: 0 0 10px 0; padding: 0; font-size: 0.8em; line-height: 1.6em;"><a href="http://www.flickr.com/photos/54866255@N00/7829695468/" title="SentenceLength"><img src="http://farm9.staticflickr.com/8287/7829695468_43e751cef8.jpg" alt="SentenceLength by Norskhaus" /></a><br/><span style="margin: 0;"><a href="http://www.flickr.com/photos/54866255@N00/7829695468/">SentenceLength</a>, a photo by <a href="http://www.flickr.com/photos/54866255@N00/">Norskhaus</a> on Flickr.</span></div><p></p>Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-31105373476876077462012-02-20T00:19:00.000-08:002018-09-23T16:17:50.305-07:00Summarizing Documents with Machine Learning: Part 1<h2>
Summarizing Documents with Machine Learning: Part 1</h2>
<div>
This is an investigation into analyzing documents with Latent Semantic Analysis, or LSA. In particular, we will find the key sentences and tag clouds in a technical paper with Singular Value Decomposition, inspired by this paper: </div>
<div>
<blockquote class="tr_bq">
<in><a href="http://scholar.google.com/scholar?hl=en&q=%22Creating+Generic+Text+Summaries.%22+gong"><span style="font-size: large;">Creating Generic Text Summaries</span></a></in></blockquote>
</div>
The application is add summaries to the <a href="http://www.lucidimagination.com/search" target="_blank">LucidFind</a> application, a search engine devoted to open-source text processing projects.<br />
<h3>
Core concepts</h3>
<h4>
Document Corpus</h4>
A corpus is just a collection of documents. There are many standardized text document corpuses out there used a data in research, the way standardized white mice strains are used in laboratory research on RAs. In this case, we use the sentences in one document as the corpus for that document. The matrix has sentences for rows and terms for columns.<br />
<h4>
Term Vector</h4>
A <b>term vector</b> is a "bag of words" with numbers attached. We are going to take all of the words in the document corpus and create a matrix where documents are on one side and words are on the other, and add a value for each word where it occurs in a document. The standard ways are to add a count, or just set it to one. This is a matrix where the rows are term vectors for the documents. This gives a numerical representation of the document corpus which has all grammatical information ripped out. This <i>Bayesian</i> representation is surprisingly effective, as we will see.<br />
<h4>
Singular Value Decomposition</h4>
<b>SVD</b> is a linear algebra operation. Given a matrix, it creates three matrices based on the original matrix. The one we care about is the "left feature matrix", commonly called U in the literature. The rows are "document feature vectors", which represent a kind of importance for each document. A <i>feature</i>, in machine learning, refers to signal in noisy data. Applied to a matrix of sentences v.s. terms, the document with the largest value in the left-hand column is the most "important" in the corpus. The remaining values in the row also contribute to importance.<br />
<h4>
Tying it together</h4>
LSA makes a bag-of-words matrix from the document corpus, and then tries to find the documents that most reflect the "essence" of the corpus. In this project we try to find the most essential sentences in a technical paper by doing a linear algebra analysis using both the sentences and individual words.<br />
<h3>
Preliminary Results</h3>
Prelims are interesting, but nowhere near perfect. I've done this on one test document, and it clearly homes in on important declarative sentences. The test document is wrong for this- it's filled with formulae, and so do you make terms out of greek letters?<br />
<h4>
Raw Data</h4>
I've attached the raw text used in this exercise, and the sentences ordered by their SVD norm. I've tried several variations on the algorithm, and the first few sentences always turn up, so it's a pretty strong signal. Here are the sorted sentences, with the sentence # in the paper and the sorting value. <br />
The most interesting:<br />
<blockquote>
<span style="font-size: x-small;">{216, 1.0000000000000173} PERFORMANCE OF AVERAGED CLASSIFIERS 1719 and Mason [9] show how to break down the problem of learning alternating decision trees (a class of rules which generalizes decision trees and boosted decision trees) into a sequence of simpler learning problems using boosting.</span><br />
<span style="font-size: x-small;">{36, 1.0000000000000162} Second, we choose a class of prediction rules (mappings from the input to the binary output) and assume that there are prediction rules in that class whose probability of error (with respect to the distribution D) is small, but not necessarily equal to zero.</span><br />
<span style="font-size: x-small;">{44, 1.0000000000000142} On the other hand, if we allow the algorithm to output a zero, we can hope that the algorithm will output zero on about 4% of the input, and will be incorrect on about 1% of the data.</span><br />
<span style="font-size: x-small;">{26, 1.0000000000000129} Instead the basic argument is that the Bayesian method is always the best method, and therefore, the only important issues are how to choose a good prior distribution and how to efficiently calculate the posterior average.</span><br />
<span style="font-size: x-small;">{123, 1.0000000000000127} Unlike the event of a classification mistake, which depends both on the predicted label and the actual label, the event of predicting 0 does not depend on the actual label.</span><br />
<span style="font-size: x-small;">{179, 1.0000000000000122} For example, if the goal is to detect a rare type of instance within a large set, the correct method might be to sort all instances according to their log- ratio score and output the instances with the highest scores.</span></blockquote>
The middle:<br />
<blockquote>
<span style="font-size: x-small;">{202, 1.0000000000000022} In other words, the behavior of our algorithm is, in fact, similar to that of large margin classifiers.</span><br />
<span style="font-size: x-small;">{11, 1.000000000000002} By minimiz- ing this cost, the learning algorithm attempts to minimize both the training error and the amount of overfitting.</span><br />
<span style="font-size: x-small;">{20, 1.000000000000002} PERFORMANCE OF AVERAGED CLASSIFIERS 1699 considerable experimental evidence that such averaging can significantly reduce the amount of overfitting suffered by the learning algorithm.</span><br />
<span style="font-size: x-small;">{21, 1.000000000000002} However, there is, we believe, a lack of theory for explaining this reduction.</span><br />
<span style="font-size: x-small;">{42, 1.000000000000002} MANSOUR AND R.</span><br />
<span style="font-size: x-small;">{55, 1.000000000000002} Of course, if the generated classifier outputs zero most of the time, then there is no benefit from having it.</span><br />
<span style="font-size: x-small;">{139, 1.000000000000002} Doing this also improves the guaranteed performance bounds because it reduces |H |.</span><br />
<span style="font-size: x-small;">{210, 1.000000000000002} Obviously, calculating the error of all of the hypotheses in the class is at least as hard as finding the best hypothesis and probably much harder.</span><br />
<span style="font-size: x-small;">{9, 1.0000000000000018} To overcome this problem, one usually uses either model-selection or reg- ularization terms.</span><br />
<span style="font-size: x-small;">{41, 1.0000000000000018} FREUND, Y.</span><br />
<span style="font-size: x-small;">{88, 1.0000000000000018} Note that the statements of Theorems 1 and 2 have no dependence on the hypothesis class H .</span><br />
<span style="font-size: x-small;">{92, 1.0000000000000018} The second inequality uses the fact that changing one example can change the empirical error by at most 1/m.</span><br />
<span style="font-size: x-small;">{95, 1.0000000000000018} Given x, we partition the hypothesis set H into two.</span></blockquote>
And the least:<br />
<blockquote>
<span style="font-size: x-small;">{6, 1.0} Consider a binary classification learning problem.</span><br />
<span style="font-size: x-small;">{16, 1.0} 1Supported in part by a grant from the Israel Academy of Science.</span><br />
<span style="font-size: x-small;">{50, 1.0} Each of these makes two mistakes on this data set.</span><br />
<span style="font-size: x-small;">{149, 1.0} 1714 THEOREM 6.</span><br />
<span style="font-size: x-small;">{150, 1.0} FREUND, Y.</span><br />
<span style="font-size: x-small;">{238, 1.0} FREUND, Y.</span><br />
<span style="font-size: x-small;">{112, 0.9999999999999999} FREUND, Y.</span><br />
<span style="font-size: x-small;">{155, 0.9999999999999998} MANSOUR AND R.</span><br />
<span style="font-size: x-small;">{208, 0.9999999999999998} Consider first the computational issue.</span><br />
<span style="font-size: x-small;">{105, 0.999999999999999} MANSOUR AND R.</span><br />
<span style="font-size: x-small;">{239, 0.999999999999999} MANSOUR AND R.</span><br />
<span style="font-size: x-small;">{5, 0.9999999999999972} Introduction.</span></blockquote>
Notes to experts:<br />
<ul>
<li>The terms were simple counts.</li>
<li>The sentences are sorted by the norms of the feature vectors. With the vectors multiplied by S, the longest sentences went straight to the top. Without S, length was nowhere near as strong.</li>
<li>The input data came from block copy&paste out of a PDF reader, processed with the OpenNLP Sentence Detector. It is suitably messy.</li>
<li>MANSOUR and FREUND are the authors; these terms appear in headers and footers and are thus important terms.</li>
</ul>
<h3>
Next steps<br />
</h3>
<h4>
Data cleaning with OpenNLP</h4>
Many of the sentences are so long that perhaps breaking longer sentences into clauses will give a more tractable dataset. Sentences with "A, however, B" and "C?" are probably not central to the discussion. OpenNLP is an Apache project for Natural-Language Processing. It has a sentence parsing toolkit that can help us split long sentences.<br />
<h4>
Tuning</h4>
The literature recommends various formulae for the term counts.<br />
<h4>
Tag Clouds</h4>
The important words should be the tag cloud, right?Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-56575192426422021662011-08-15T02:03:00.000-07:002011-08-17T01:49:24.124-07:00Singular vectors for recommendationsThis is a project to research:<br />
<ol><li>Reproducing these results: <a href="http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/">http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/</a></li>
<li>Correlating the feature vectors and singular values from an SVD-based recommender to the generated vectors for users and items. </li>
</ol>This was inspired by a lecture by one of the top 5 in the Netflix contest: the guy demonstrated axes of interest: chick flix v.s. Star Trek, Harry Potter v.s. Stanley Kubrick, etc. These clusters in the full item space are at the endpoints of vectors which can be realized from the feature vectors and singular values.<br />
<br />
<a href="https://github.com/LanceNorskog/LSH-Hadoop/blob/master/extras/mahout/test/java/org/apache/mahout/cf/taste/impl/common/TestOpposites.java">TestOpposites</a>.java<br />
<br />
This program and the following chart are my recreation of the raw data from the article above. BTJF are the original user/item values used to create the projection: Ben, Jeff, Tom, Fred. Bob, Love and Hate are Bob from the article; Love and Hate are users who love and hate all six seasons.<br />
<br />
"Singular" and "Singular Div" are the first two feature vector columns of the SVD left-hand matrix. They are orthogonal. In the later chart, "Shifted data", we will use them to find "axes of interest" for the different items.<br />
<br />
<span style="font-size: large;">Raw data:</span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqP8Nw4tHi-Ahml88WUpLLyxaoyFEUa7_bvIE-2zjdvSpjTlMW7_zuTkmJYCkUJSLl3hbEKQ4UYOq3Go-HAbcRVFPNFPzk9ZmcRWIvO5pAzmtz7sjMlZJAAkOV8y02lkMPIkqW_yTLUrI/s1600/Raw_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="270" naa="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqP8Nw4tHi-Ahml88WUpLLyxaoyFEUa7_bvIE-2zjdvSpjTlMW7_zuTkmJYCkUJSLl3hbEKQ4UYOq3Go-HAbcRVFPNFPzk9ZmcRWIvO5pAzmtz7sjMlZJAAkOV8y02lkMPIkqW_yTLUrI/s400/Raw_data.jpg" width="400" /></a></div><div></div><span style="font-size: large;">Shifted Data:</span><br />
And now, the magic. The space of users is centered to 0. The two feature vectors are mirrored across 0,0, and the two orthogonal singular/feature vectors are downsized by their singular values. Normalized to add up to zero, the first singular value is 0.6 and the second is 0.25. In this chart, we take the original positions of the feature vectors and multiply them by 0.6 (yellow triangles) and 0.25 (red asterisks). And, I've drawn lines between the downsized versions. Now we have the 4 original users who established this space, three new users who are projected into this space, and the two major axes of interest.<br />
<br />
<div></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWa3_M7p0-q_iPJTLatYdif8vQ6BH7A4LPVvseHr9tZjBHTwhiOZJl9_wowA67MBra8s_fhdM7BtbbLEANMJSs9IWI_ca0zoB_uzn-agP9YpJnsbMdA4AIEAJoiYqQZOdbASGVRvxtQ7g/s1600/Shifted_data.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="270" naa="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWa3_M7p0-q_iPJTLatYdif8vQ6BH7A4LPVvseHr9tZjBHTwhiOZJl9_wowA67MBra8s_fhdM7BtbbLEANMJSs9IWI_ca0zoB_uzn-agP9YpJnsbMdA4AIEAJoiYqQZOdbASGVRvxtQ7g/s400/Shifted_data.jpg" width="400" /></a></div>Observations:<br />
<ol><li>The four original users (Billy-Bob, Trimolchio, Jenga and Ferdinand) all had somewhat orthogonal item ratings, and come out in an arc. One of them is far from the others on a large circle, and the feature vectors make sense given the "gravity" of the four users on the circle.</li>
<li>Bob also had an item rating vector with the same style of pluses&minuses as BJTF, and appears at an expected place.</li>
<li>The singular feature vectors (yellow triangles, red asterisks) do give "axes of interest" that make sense v.s. BTJF and Bob.</li>
<li>Love and Hate are the nearest to the two ends of the dominant axis of interest. They are also nearly between the endpoints of the axes of interest.</li>
</ol>Conclusions:<br />
<ul><li>This technique gives two results:</li>
<ul><li>It supplies axes of interest.</li>
<li>It allows a new user to describe himself based on the major axes of interest.</li>
</ul><li>There is a fine yellow line between Love and Hate.</li>
</ul>Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com1tag:blogger.com,1999:blog-3095049372219674792.post-54724125509051449512011-08-14T20:37:00.000-07:002011-08-14T20:39:09.814-07:00Sorted recommender dataThese are some images from experiments in sorting a ratings data model. This post has a few sorted versions of the GroupLens 1million sample database. Green and red are sequence values in the sorted output; they help visualize dense to sparse. Red is dense, green is sparse. 5% of men have red-green colorblindness.<br />
<br />
Sorted by user:<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh80yj4X-DtN1RAI3wxQtlPPxwXCQSicDZZgkZts5EyEzS3cfjq5wHECCa4RxzMr6t0a6bblx5JtIas-NVmk1UNG_5ym5ocEssJ9DHCPSPXDmXBebjbcmN9dbzcPPDKvXS_deWcc0Hskfc/s1600/UsersSorted_100k_of_1m_GL.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" naa="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh80yj4X-DtN1RAI3wxQtlPPxwXCQSicDZZgkZts5EyEzS3cfjq5wHECCa4RxzMr6t0a6bblx5JtIas-NVmk1UNG_5ym5ocEssJ9DHCPSPXDmXBebjbcmN9dbzcPPDKvXS_deWcc0Hskfc/s320/UsersSorted_100k_of_1m_GL.png" width="315" /></a></div>Sorted by item:<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeUi9mG-x79D-e3rnXU4RCUB7sigtUuZxa-k-Whik6QSvsLZyFxs-iXhdsAOJpj-LkqezIhETy6iQ56q9vxmx-u7Zw0D1h2LIM99ywxSRgXt7BumAUwM0KZAQg_OOm61McP1wwCiH4lNQ/s1600/UsersSorted_100k_of_1m_GL.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" naa="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeUi9mG-x79D-e3rnXU4RCUB7sigtUuZxa-k-Whik6QSvsLZyFxs-iXhdsAOJpj-LkqezIhETy6iQ56q9vxmx-u7Zw0D1h2LIM99ywxSRgXt7BumAUwM0KZAQg_OOm61McP1wwCiH4lNQ/s320/UsersSorted_100k_of_1m_GL.png" width="315" /></a></div>Sorted by user and item:<br />
<br />
<div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWyt6pm2PuQby5gFAGgSleeC6kFQoc73JaabSYYJ5yyggn_ZDubRmiDmdZp9glj2Z0_GEC6WJoie3V6ocFhUFNICYKCBW0KfnOJ115Oa1XzsZ6mgGsgWchxoPQIMofECgArnFtdQgVa_k/s1600/DualSorted_500k_of_1m_GL.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="313" naa="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWyt6pm2PuQby5gFAGgSleeC6kFQoc73JaabSYYJ5yyggn_ZDubRmiDmdZp9glj2Z0_GEC6WJoie3V6ocFhUFNICYKCBW0KfnOJ115Oa1XzsZ6mgGsgWchxoPQIMofECgArnFtdQgVa_k/s320/DualSorted_500k_of_1m_GL.png" width="320" /></a></div><span style="font-size: large;">Why is this interesting?</span><br />
The lower left, red corner, has popular items. The upper right has the "long tail". I am a long tail guy; I don't really care about popular movies. Japanese female assassin movies, BBC comedy, Brazilian horror movies are on the list. An aquaintance received recommendations from Amazon for French Post-Structuralist literature (I don't know either) and pornographic comic books. "Someone finally understands me!"<br />
A recommendation for me should be biased to the upper right. This sorting gives one algorithm to add to the pile.<br />
<br />
<span style="font-size: large;">Details</span><br />
Program: SortDataModel.java, ModelComparator.java and maybe some other things in this repository.<br />
<div class="separator" style="clear: both; text-align: center;"></div><a href="https://github.com/LanceNorskog/LSH-Hadoop/blob/master/extras/mahout/test/java/org/apache/mahout/cf/taste/impl/common/SortDataModel.java">https://github.com/LanceNorskog/LSH-Hadoop/blob/master/extras/mahout/test/java/org/apache/mahout/cf/taste/impl/common/SortDataModel.java</a><br />
<br />
Visuals by KNime.Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com1tag:blogger.com,1999:blog-3095049372219674792.post-84473129753769403272011-08-14T20:19:00.000-07:002011-08-14T20:19:56.187-07:00Koren on recommenders - comments about temporal changes in rating datahttp://videolectures.net/kdd09_koren_cftd/<br />
<br />
Talks about how the timestamps for watching/rating events are important. In general, items change slowly in rating values, while users have sudden changes in rating values. <br />
<br />
Slides stolen:<br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://carbon.videolectures.net/2009/contrib/kdd09_paris/koren_cftd/kdd09_koren_cftd.zip.slides/kdd09_koren_cftd_Page_18.480.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="360" src="http://carbon.videolectures.net/2009/contrib/kdd09_paris/koren_cftd/kdd09_koren_cftd.zip.slides/kdd09_koren_cftd_Page_18.480.jpg" width="480" /></a></div><br />
<div class="separator" style="clear: both; text-align: center;"><a href="http://carbon.videolectures.net/2009/contrib/kdd09_paris/koren_cftd/kdd09_koren_cftd.zip.slides/kdd09_koren_cftd_Page_19.480.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="360" src="http://carbon.videolectures.net/2009/contrib/kdd09_paris/koren_cftd/kdd09_koren_cftd.zip.slides/kdd09_koren_cftd_Page_19.480.jpg" width="480" /></a></div><br />
<br />
<br />
<br />
Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-87072017375524913222011-07-10T19:39:00.000-07:002011-07-10T19:39:13.355-07:00Dimensional Reduction via Random Projection, Part 4: Better Living Through Rotation!Thanks to a <a href="http://www.lucidimagination.com/search/document/d222a84092b4c854/dimensional_reduction_via_random_projection_investigations#90657a5825e03ad2">tip</a> from <a href="http://tdunning.blogspot.com/">Ted Dunning</a>, I have rotated the four distributions via factorizing the distributions. <br />
<br />
<table><tbody>
<tr><td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEji3ouEA1xNADgjKtwhTTLuu2hzKC6FNFhWreN5LIvnulGgdJqKrcNnNn2ZbrKW3k7DVzFevn6u4RTldHHx5wtN94pJu4vZGuDmuZgAQ4tc7M_DIuJj_7O78jFrcbGohm5UePwGmzkKSQI/s1600/SVR_200x2_gaussian.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEji3ouEA1xNADgjKtwhTTLuu2hzKC6FNFhWreN5LIvnulGgdJqKrcNnNn2ZbrKW3k7DVzFevn6u4RTldHHx5wtN94pJu4vZGuDmuZgAQ4tc7M_DIuJj_7O78jFrcbGohm5UePwGmzkKSQI/s320/SVR_200x2_gaussian.png" width="240" /></a></div></td><td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8AVMa-M99CmNs-FZVzKTo3-6OMvs3iH-7KJSfBCirm2FcDKwHUGGkBYN6D30dJcKxDL7QiZJohIXnge4DP_IGfWifncFDSaXGD6HmhxW3aQwZFZKDxFEoJiVWy0zbAWRfYr6XLWhLPWQ/s1600/SVR_200x2_plusminus.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8AVMa-M99CmNs-FZVzKTo3-6OMvs3iH-7KJSfBCirm2FcDKwHUGGkBYN6D30dJcKxDL7QiZJohIXnge4DP_IGfWifncFDSaXGD6HmhxW3aQwZFZKDxFEoJiVWy0zbAWRfYr6XLWhLPWQ/s320/SVR_200x2_plusminus.png" width="240" /></a></div></td></tr>
<tr><td><em>Gaussian</em></td><td><em>+1/-1</em></td></tr>
<tr><td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj26ReSE9eLIwxRN42F3usJdP8B1A3mj7W0-8xECfXqa8c6eFwdtw6cpyI8Hc8TnPFFCPmXsbLsfpD0XEd_aV6fm5XI6xIePlg6zPcjywB5oLZXAlaKTVFk1xcP1q-cXYMrDXUawFye_3E/s1600/SVR_200x2_sqrt3.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj26ReSE9eLIwxRN42F3usJdP8B1A3mj7W0-8xECfXqa8c6eFwdtw6cpyI8Hc8TnPFFCPmXsbLsfpD0XEd_aV6fm5XI6xIePlg6zPcjywB5oLZXAlaKTVFk1xcP1q-cXYMrDXUawFye_3E/s320/SVR_200x2_sqrt3.png" width="240" /></a></div></td><td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFwcc1tm4xec4d4cu1t4CmN93PwTLpLuW76bNAQneKNeaxjWQb0Wr1P568ywR__slRTtFQiWU9t3K3VkQGMmv3t9CJVFJaTV9hEqO5nQN2yQa6OvkWa0NwxZT-UQ0ocs2-jF-Gt8o1OiM/s1600/SVR_200x2_linear.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFwcc1tm4xec4d4cu1t4CmN93PwTLpLuW76bNAQneKNeaxjWQb0Wr1P568ywR__slRTtFQiWU9t3K3VkQGMmv3t9CJVFJaTV9hEqO5nQN2yQa6OvkWa0NwxZT-UQ0ocs2-jF-Gt8o1OiM/s320/SVR_200x2_linear.png" width="240" /></a></div></td></tr>
<tr><td><em>Sqrt3</em></td><td><em>Linear</em></td></tr>
</tbody></table><br />
These look mighty funky. I've taken this as far as I can without being able to browse the item metadata (movie names).Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-23824918273575392882011-07-07T20:13:00.000-07:002011-07-07T20:13:41.186-07:00R Programming Resources #1<a href="http://www.johndcook.com/R_language_for_programmers.html">The R type system</a>: I knew there was something weird going on.<br />
<a href="http://www.r-bloggers.com/select-operations-on-r-data-frames/">Operations on data frames </a><br />
<a href="http://www.r-bloggers.com/r-tutorial-series-r-beginners-guide-and-r-bloggers-updates/">R Tutorial Series</a><br />
<a href="http://www.r-bloggers.com/" target="_blank">R Bloggers</a>Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-67853633761136792532011-07-01T02:33:00.000-07:002011-07-01T02:33:31.145-07:00Dimensional Reduction via Random Projection, Part 1: Huh?This recounts a short research project in using random projection for visualization.<br />
<br />
I started with vectors which represent items in a data model. (See my Semantic Vectors posts.) They are from 100k entries in the GroupLens film ratings dataset, representing ratings by 669 users on 3274 items. The 200 vectors are semi-randomly chosen from the item pool. All are 200 dimensions, dense. Euclidean distances signify nearness for Item-Item recommendation. The item-item distances give good ratings in the 150-200 dimension range, so it is reasonable to pick 200 dimensions as having rich support. The subset of item vectors have strong user-user bonds, so the recommendation structure is somewhat visible. This was a mistake, but I like it.<br />
<br />
The project uses simple 2D visualization to decide whether random projection is as good as other, older methods of dimensional reduction. I used KNime's MDS implementation. MDS (Multi-Dimensional Scaling) is an older algorithm for creating a projection from a high dimensional space to a low dimensional, retaining pairwise distances. MDS is tuneable for the amount of information dropped.<br />
<br />
Each test run uses the same 200 vectors in the same order. Each vector has the same color each time (from vivid green to black to vivid red).<br />
<br />
The main axis of investigation is between three projection methods: complete Random Projection, partial Random Projection, and complete MDS. The following scatter charts show the vectors transformed to 2d vectors, and shown in a scatter-plot (X and Y are generated dimensions, color is the vector).<br />
<ol><li>Random projection from 200 dimensions to 2 dimensions.</li>
<li>Random projection from 200 dimensions to 20 dimensions, then use MDS to boil down to 2 dimensions.</li>
<li>Use MDS to convert from 200 dimensions to two dimensions.</li>
</ol><table><tbody>
<tr> <td><b><i>1.</i></b><br />
<i>Full Random Projection from 200 dimensions to 2 dimensions (Gaussian)</i></td> <td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4Yr4hyphenhyphenhnS6UR4ZCHe4WKiHo62mPnnj0JH-0-DjWYDpzs79PNCn-NmgkiavZRlIey8mbwhzC2nv-k9k6Zelgaz5F_b0EO-TLCavXJg1n-_khaBBra9rd_SNbDkEQ79SnND8mnWadWyQSo/s1600/SVR_200x2_gaussian.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4Yr4hyphenhyphenhnS6UR4ZCHe4WKiHo62mPnnj0JH-0-DjWYDpzs79PNCn-NmgkiavZRlIey8mbwhzC2nv-k9k6Zelgaz5F_b0EO-TLCavXJg1n-_khaBBra9rd_SNbDkEQ79SnND8mnWadWyQSo/s320/SVR_200x2_gaussian.png" width="320" /></a></div></td></tr>
<tr> <td><b><i>2.</i></b><br />
<i>Random Projection from 200D to 20D, then MDS to 2D (Gaussian, KNime MDS implementation)</i></td> <td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEja8uixE65QpxiTm557WRxHyA7lCrH5j5yWMkzVIZCWApmBx19BJXiGKUObZ7QYVf0RAb3yBSLwl6IgzYNJiI4EXBhAaWejxVa919OdvPsFMVonU85BrMGZyM1tBt-flGFV8H4RqWSBuqA/s1600/SVR_200x20_MDS_40_gaussian.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEja8uixE65QpxiTm557WRxHyA7lCrH5j5yWMkzVIZCWApmBx19BJXiGKUObZ7QYVf0RAb3yBSLwl6IgzYNJiI4EXBhAaWejxVa919OdvPsFMVonU85BrMGZyM1tBt-flGFV8H4RqWSBuqA/s320/SVR_200x20_MDS_40_gaussian.png" width="320" /></a></div></td></tr>
<tr> <td><b><i>3.</i></b><br />
<i>Full MDS from 200d to 2d (KNime MDS Implementation)</i></td> <td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRwbndKYEJCvNEI1qshBaRxrPkabI1_ZXfbmQMufNGgdSl4C2t0BnpRtYUDf5Yg7_3Utph-8ujy3TKUHgqCJH8ada_AbH3B_HnCjvrGUpVBKujkxWeMy10ACOcBmMTVBAhytTORkD9Yrg/s1600/SVR_200_full_MDS_40.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRwbndKYEJCvNEI1qshBaRxrPkabI1_ZXfbmQMufNGgdSl4C2t0BnpRtYUDf5Yg7_3Utph-8ujy3TKUHgqCJH8ada_AbH3B_HnCjvrGUpVBKujkxWeMy10ACOcBmMTVBAhytTORkD9Yrg/s320/SVR_200_full_MDS_40.png" width="320" /></a></div></td></tr>
</tbody> </table>Conclusions:<br />
<ul><li>Using full random projection does not completely trash the data, but it's challenging to interpret.</li>
<li> Downsizing the data to 20 dimensions and then using MDS the rest of the way is somewhat cleaner, and good enough to work with. It's also faster.</li>
<li>Full MDS shows a strikingly clear picture of the item-item structure.</li>
</ul>With later work it has become clear that these are the same projection, rotated. Later I would like to reduce to 3D and do animations of rotating these vector sets. <br />
<br />
<i>Onward to Part 2: there is more than one random projection.</i><br />
<br />
I did these diagrams with the KNime visual programming app for data mining. <a href="http://www.knime.org/">All Hail KNime!</a>Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-62131739681780941022011-07-01T02:32:00.001-07:002011-07-01T02:33:07.202-07:00Dimensional Reduction via Random Projection, Part 2: DistributionsThere is more than one random distribution, and there are some surprises in store.<br />
<br />
Achtioplas (2001) claims that random projection does not need fully distributed values: +1/-1 chosen with linear random suffices. Even better, a linear distribution of [0, 0, 0, 0, sqrt(3), -sqrt(3)] also works. This throws away 4 out of 6 input values. <br />
<br />
This post explores applying these four different distributions to the same random projection. To recap, here is the best I've got: a full MDS projection from 200 dimensions to 2 dimensions.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgINHdGprVUyVztwUAqVkBYYqJuCTkr4lwy0mc3ukOA-hxIHat1INk0HGGR2sctmPqxCVQ1zgN8mt93YZxkEiHoFLq2mgx49bJDXuRen9-8gezk8iwt_nCVOURR3wR5p1ENFYMt9pFQdAQ/s1600/SVR_200_full_MDS_40.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgINHdGprVUyVztwUAqVkBYYqJuCTkr4lwy0mc3ukOA-hxIHat1INk0HGGR2sctmPqxCVQ1zgN8mt93YZxkEiHoFLq2mgx49bJDXuRen9-8gezk8iwt_nCVOURR3wR5p1ENFYMt9pFQdAQ/s320/SVR_200_full_MDS_40.png" width="320" /></a></div><br />
Here are four versions of the same dataset, reducing 200 dimensions to 2 dimensions via random projection:<br />
<br />
<table><tbody>
<tr> <td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoc_SN903oxTI_AbZRMTNr4K5yAedXsZYalx67n_rF6aV3AkwLlplmqwxA_1yfqswryUqaoMfe933TsoriTgyEPjAkBkfiiE8xh3GbauGxmwht_3uFrw7L7SAKFtF6DsN25EuGJ9Yzb5U/s1600/SVR_200x2_gaussian.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoc_SN903oxTI_AbZRMTNr4K5yAedXsZYalx67n_rF6aV3AkwLlplmqwxA_1yfqswryUqaoMfe933TsoriTgyEPjAkBkfiiE8xh3GbauGxmwht_3uFrw7L7SAKFtF6DsN25EuGJ9Yzb5U/s320/SVR_200x2_gaussian.png" width="240" /></a></div></td> <td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipFkQkce3J2TvlvuYkY96dfwYH4cJRwAtyKtkLwJtNb-1FrADBFfCmeTtKsrskyxiRg7PdsN32fM8yC4ar3UOZHg5xXMLccTbzPbQM0MgYn5HHw0bmS5aEDgK0KkgQxz9_gHT55VIs2gs/s1600/SVR_200x2_half.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipFkQkce3J2TvlvuYkY96dfwYH4cJRwAtyKtkLwJtNb-1FrADBFfCmeTtKsrskyxiRg7PdsN32fM8yC4ar3UOZHg5xXMLccTbzPbQM0MgYn5HHw0bmS5aEDgK0KkgQxz9_gHT55VIs2gs/s320/SVR_200x2_half.png" width="240" /></a></div></td> </tr>
<tr><td><i>Gaussian</i></td><td><i>+1/-1</i></td></tr>
<tr> <td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPADX9IV0ZB6kpFVOVdaQyAeeBIjOQ-_TuC4CQk3IPHw_L7toLC9uKrz1cM7w3RLxiErvK9SQAdElC52BjfYaoQWzsV8deEOoSadPPxGZmDUmIQ_DnIMTOP9A7zdTjxXRmXRktG9CVkYY/s1600/SVR_200x2_sqrt3.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPADX9IV0ZB6kpFVOVdaQyAeeBIjOQ-_TuC4CQk3IPHw_L7toLC9uKrz1cM7w3RLxiErvK9SQAdElC52BjfYaoQWzsV8deEOoSadPPxGZmDUmIQ_DnIMTOP9A7zdTjxXRmXRktG9CVkYY/s320/SVR_200x2_sqrt3.png" width="240" /></a></div></td> <td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7f2U4JPuJMENFoODBKb6ZoYr0o6oXz80pwEHsqqPJd528LuB8FyCDTMAvFUKJ2dCuN26zRo4lFGPiFcy-xFlgxQ1C4b1mjHH4PIH6ERMPz1nWMvNSDPWI4wiH-r2JYj0pLV8c-CoAZjQ/s1600/SVR_200x2_linear.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7f2U4JPuJMENFoODBKb6ZoYr0o6oXz80pwEHsqqPJd528LuB8FyCDTMAvFUKJ2dCuN26zRo4lFGPiFcy-xFlgxQ1C4b1mjHH4PIH6ERMPz1nWMvNSDPWI4wiH-r2JYj0pLV8c-CoAZjQ/s320/SVR_200x2_linear.png" width="240" /></a></div></td> </tr>
<tr><td><i>Sqrt(3)</i></td><td><i>Linear</i></td></tr>
</tbody></table><br />
The full MDS version is certainly the most pleasant to look at. The Gaussian and 2 distributions from Antiochplas all seem to be different rotations of a cylinder. The Linear distribution is useless in this situation. <br />
<br />
Given these results, to do a quick visualization of your data, I would try all four random distributions; you may get lucky like I did with somewhat similar rotations. And, I recommend the colorizing trick; it really helps show what's going on here. After that, I would get a good dimensional reduction algorithm and do the 2-stage process:<br />
<br />
hi-dimensional -> RP -> low-d -> formal dimensional reduction -> 2d or 3d.<br />
<br />
On to part 3 for a discussion of noise. <br />
<br />
<i><b>Achlioptas, 2001</b></i><br />
<i>Database-friendly random projections: Johnson-Lindenstrauss with binary coins</i><br />
<br />
PDF available online at various places: <br />
<ul><li><a href="http://www.mendeley.com/research/databasefriendly-random-projections">Mendeley</a></li>
<li><a href="http://www.google.com/search?q=database+friendly+random+projections">You-know-where</a></li>
</ul>I did these diagrams with the KNime visual programming app for data mining. <a href="http://www.knime.org/">All Hail KNime!</a> <br />
<ul></ul>Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-51683589302965169522011-07-01T02:32:00.000-07:002011-07-01T02:32:24.694-07:00Dimensional Reduction via Random Projection, Part 3: NoiseNow for a third axis of investigation: distances.<br />
<br />
Random Projection preserves pairwise distances; this is its claim to fame. To measure this, I created matrices of distances before and after "full RP": 200-d -> RP -> 2d. Here are spreads of distances: each color is that vector's distances: the X axis is the 200-d distances, and the Y axis is the 2d distances.<br />
<table><tbody>
<tr> <td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7Fvy059mXQ_c9lvv2FbmbOJe2TiuHYBboOGSKtdJywAcyCaYJ1EQtd3UwHfWhRmKKv99y07OEo-c4nJ59rZlPQc6d_MFFG-chY0RrDTOEJLQYrnL1VS1TVEIsWJKzsF6C2ebep5lB-Vg/s1600/SVR_200_distances_gaussian.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7Fvy059mXQ_c9lvv2FbmbOJe2TiuHYBboOGSKtdJywAcyCaYJ1EQtd3UwHfWhRmKKv99y07OEo-c4nJ59rZlPQc6d_MFFG-chY0RrDTOEJLQYrnL1VS1TVEIsWJKzsF6C2ebep5lB-Vg/s320/SVR_200_distances_gaussian.png" width="240" /></a></div></td> <td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFJvgvU-5l39zNsZ5eUvHaP49dQS__jSGJdyvNdP-UIQo4narf-NS0oBQB7i9GLoauSc4I3JJRSdQMBVZD-9dmB2ddUMVUb1XYaqpsiUdtuLFGTWe9B4ZzBqLpMhkuTF6G6y7sSGizYjU/s1600/SVR_200_distances_plusminus.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFJvgvU-5l39zNsZ5eUvHaP49dQS__jSGJdyvNdP-UIQo4narf-NS0oBQB7i9GLoauSc4I3JJRSdQMBVZD-9dmB2ddUMVUb1XYaqpsiUdtuLFGTWe9B4ZzBqLpMhkuTF6G6y7sSGizYjU/s320/SVR_200_distances_plusminus.png" width="240" /></a></div></td> </tr>
<tr> <td><i>Gaussian</i></td><td><i>+1/-1</i></td> </tr>
<tr> <td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTRpntHT0WumXW5xNDKm2MqSfN5x-U5VlIi4a8TEsvlxdHbrHNKNYgCjAMHfAdH1knByybDYzfv-C6F0aDWKiKQEOSy1lXQ8MQo3aNIGDedF6_2zOFt6VxvtNXMaMZn4Gzcq8ejSDtL5M/s1600/SVR_200_distances_sqrt3.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTRpntHT0WumXW5xNDKm2MqSfN5x-U5VlIi4a8TEsvlxdHbrHNKNYgCjAMHfAdH1knByybDYzfv-C6F0aDWKiKQEOSy1lXQ8MQo3aNIGDedF6_2zOFt6VxvtNXMaMZn4Gzcq8ejSDtL5M/s320/SVR_200_distances_sqrt3.png" width="240" /></a></div><br />
<br />
</td> <td><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhbipCETFaNxP4eeKtFyKzL5m3yExI6-dmViQaPgUdTKVzpr-ngbdQPbWNlMkPO99EX41CIISalop1w5K2BlAGXpDr_JsxhJLjzmXFMIG29GYRH-28Oh3btFW2bZJsR3myQ0QVdnJYJ-A/s1600/SVR_200_distances_linear.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhbipCETFaNxP4eeKtFyKzL5m3yExI6-dmViQaPgUdTKVzpr-ngbdQPbWNlMkPO99EX41CIISalop1w5K2BlAGXpDr_JsxhJLjzmXFMIG29GYRH-28Oh3btFW2bZJsR3myQ0QVdnJYJ-A/s320/SVR_200_distances_linear.png" width="240" /></a></div><br />
<br />
</td> </tr>
<tr> <td><i>Sqrt(3)</i></td><td><i>Linear</i></td> </tr>
</tbody></table><br />
From tightest to loosest spreads, it's Linear, +1/-1, Sqrt(3) and Gaussian. Linear is so clean because it is overconstrained in this example. +1/-1 and Sqrt(3) are useable, and Gaussian looks like a bomb with smallpox. The +1/-1 and Sqrt(3) projectors look best for this case. If I wanted to work harder, I would compare the distances as matrices, find matrix norms, get standard deviations of the distances, etc. But for visualization, these worked well.<br />
<br />
I did these diagrams with the KNime visual programming app for data mining. <a href="http://www.knime.org/">All Hail KNime!</a>Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-2599810179081656572010-11-27T18:57:00.000-08:002010-12-26T01:26:44.562-08:00Semantic Vectors for Recommenders<div class="MsoNormal" style="font-family: "Trebuchet MS",sans-serif;"><b>Semantic Vectors for Recommenders</b></div><div class="Section1" style="font-family: "Trebuchet MS",sans-serif;"><br />
<div class="MsoNormal"><span style="font-size: 10pt;"> Start with a User->Item preference matrix. The values range from -0.5 to 0.5 in “linear” interpretation: 0 is neutral. -0.5 is twice as negative as -0.25. Blank values default to 0.</span></div><div align="center"><br />
<table border="1" cellpadding="0" cellspacing="0" class="MsoTableList4" style="border-collapse: collapse; border: medium none; width: 287px;"><tbody>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: black -moz-use-text-color black black; border-style: solid none solid solid; border-width: 1.5pt medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 48pt;" valign="top" width="64"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: black -moz-use-text-color; border-style: solid none; border-width: 1.5pt medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 48.8pt;" valign="top" width="65"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">Item1</span></div></td> <td nowrap="nowrap" style="border-color: black -moz-use-text-color; border-style: solid none; border-width: 1.5pt medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 48.8pt;" valign="top" width="65"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">Item2</span></div></td> <td nowrap="nowrap" style="border-color: black black black -moz-use-text-color; border-style: solid solid solid none; border-width: 1.5pt 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 69.65pt;" valign="top" width="93"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">Item3</span></div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 48pt;" valign="top" width="64"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">User1</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 48.8pt;" valign="top" width="65"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.3</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 48.8pt;" valign="top" width="65"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.6</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 69.65pt;" valign="top" width="93"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.9</span></div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 48pt;" valign="top" width="64"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">User2</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 48.8pt;" valign="top" width="65"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 48.8pt;" valign="top" width="65"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.1</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 69.65pt;" valign="top" width="93"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.4</span></div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1.5pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 48pt;" valign="top" width="64"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">User3</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 48.8pt;" valign="top" width="65"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 48.8pt;" valign="top" width="65"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1.5pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 69.65pt;" valign="top" width="93"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.7</span></div></td> </tr>
</tbody></table><br />
</div><div class="MsoNormal"><span style="font-size: 10pt;">Now, let's do a semantic vectors projection. The User vector is random:</span></div><br />
<div align="center"><table border="1" cellpadding="0" cellspacing="0" class="MsoTableList4" style="border-collapse: collapse; border: medium none; width: 95px;"><tbody>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: black -moz-use-text-color black black; border-style: solid none solid solid; border-width: 1.5pt medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 44.25pt;" valign="top" width="59"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: black black black -moz-use-text-color; border-style: solid solid solid none; border-width: 1.5pt 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 27pt;" valign="top" width="36"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">Random</span></div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 44.25pt;" valign="top" width="59"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">User1</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 27pt;" valign="top" width="36"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.2</span></div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 44.25pt;" valign="top" width="59"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">User2</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 27pt;" valign="top" width="36"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.6</span></div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1.5pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 44.25pt;" valign="top" width="59"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">User3</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1.5pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 27pt;" valign="top" width="36"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.9</span></div></td> </tr>
</tbody></table><br />
</div><div class="MsoNormal"><span style="font-size: 10pt;">The formula for the Item outputs is:</span></div><div align="center" class="MsoNormal" style="text-align: center;">Item(i) = ((sum(U)+ sum(pref(u,i)/#U))/2)/#U</div><br />
<div align="center" class="MsoNormal" style="text-align: center;">Where #U = the number of users expressing a preference for Item(i)</div><div align="center"><br />
<table border="1" cellpadding="0" cellspacing="0" class="MsoTableGrid" style="border-collapse: collapse; border: medium none;"><tbody>
<tr style="height: 13.4pt;"> <td style="border: 1pt solid windowtext; height: 13.4pt; padding: 0in 5.4pt; width: 76.95pt;" valign="top" width="103"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">I1</span></b></div></td> <td style="border-color: windowtext windowtext windowtext -moz-use-text-color; border-style: solid solid solid none; border-width: 1pt 1pt 1pt medium; height: 13.4pt; padding: 0in 5.4pt; width: 389.15pt;" valign="top" width="519"><br />
<div class="MsoNormal" style="margin-right: -3.95in;"><b><span style="font-size: 10pt;">Not enough preferences</span></b></div></td> </tr>
<tr style="height: 13.4pt;"> <td style="border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 13.4pt; padding: 0in 5.4pt; width: 76.95pt;" valign="top" width="103"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">I2</span></b></div></td> <td style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 13.4pt; padding: 0in 5.4pt; width: 389.15pt;" valign="top" width="519"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">(((U1 + U2) + ((pref( u1,i2) + pref(u2,i2))/#U))/2)/#U</span></b></div></td> </tr>
<tr style="height: 13.4pt;"> <td style="border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 13.4pt; padding: 0in 5.4pt; width: 76.95pt;" valign="top" width="103"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">I3</span></b></div></td> <td style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 13.4pt; padding: 0in 5.4pt; width: 389.15pt;" valign="top" width="519"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">(((U1 + U2 + U3) + ((pref( u1,i3) + pref(u2,i3) + pref(u3,i3))/#U)/2//#U</span></b></div></td> </tr>
</tbody></table><br />
</div><div align="center"><br />
<table border="1" cellpadding="0" cellspacing="0" class="MsoTableGrid" style="border-collapse: collapse; border: medium none;"><tbody>
<tr style="height: 13.2pt;"> <td style="border: 1pt solid windowtext; height: 13.2pt; padding: 0in 5.4pt; width: 76.95pt;" valign="top" width="103"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">I1</span></b></div></td> <td style="border-color: windowtext windowtext windowtext -moz-use-text-color; border-style: solid solid solid none; border-width: 1pt 1pt 1pt medium; height: 13.2pt; padding: 0in 5.4pt; width: 389.15pt;" valign="top" width="519"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">No vector</span></b></div></td> </tr>
<tr style="height: 13.2pt;"> <td style="border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 13.2pt; padding: 0in 5.4pt; width: 76.95pt;" valign="top" width="103"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">I2</span></b></div></td> <td style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 13.2pt; padding: 0in 5.4pt; width: 389.15pt;" valign="top" width="519"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">(((0.2 + 0.6) + (0.6 + 0.1)/2)/2)/2</span></b></div></td> </tr>
<tr style="height: 13.2pt;"> <td style="border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 13.2pt; padding: 0in 5.4pt; width: 76.95pt;" valign="top" width="103"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">I3</span></b></div></td> <td style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 13.2pt; padding: 0in 5.4pt; width: 389.15pt;" valign="top" width="519"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">(((0.2 + 0.6 + 0.9) + (0.9 + 0.4 + 0.7)/3)/2)/3</span></b></div></td> </tr>
</tbody></table><br />
</div><div align="center"><br />
<table border="1" cellpadding="0" cellspacing="0" class="MsoTableGrid" style="border-collapse: collapse; border: medium none;"><tbody>
<tr style="height: 13.15pt;"> <td style="border: 1pt solid windowtext; height: 13.15pt; padding: 0in 5.4pt; width: 76.95pt;" valign="top" width="103"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">I1</span></b></div></td> <td style="border-color: windowtext windowtext windowtext -moz-use-text-color; border-style: solid solid solid none; border-width: 1pt 1pt 1pt medium; height: 13.15pt; padding: 0in 5.4pt; width: 389.15pt;" valign="top" width="519"><br />
<div class="MsoNormal"><b>No vector</b></div></td> </tr>
<tr style="height: 13.15pt;"> <td style="border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 13.15pt; padding: 0in 5.4pt; width: 76.95pt;" valign="top" width="103"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">I2</span></b></div></td> <td style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 13.15pt; padding: 0in 5.4pt; width: 389.15pt;" valign="top" width="519"><br />
<div class="MsoNormal"><b>0.2875 </b></div></td> </tr>
<tr style="height: 13.15pt;"> <td style="border-color: -moz-use-text-color windowtext windowtext; border-style: none solid solid; border-width: medium 1pt 1pt; height: 13.15pt; padding: 0in 5.4pt; width: 76.95pt;" valign="top" width="103"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">I3</span></b></div></td> <td style="border-color: -moz-use-text-color windowtext windowtext -moz-use-text-color; border-style: none solid solid none; border-width: medium 1pt 1pt medium; height: 13.15pt; padding: 0in 5.4pt; width: 389.15pt;" valign="top" width="519"><br />
<div class="MsoNormal"><b>0.3944…..</b></div></td> </tr>
</tbody></table><div class="MsoNormal"><br />
</div></div><br />
<div class="MsoNormal"><span style="font-size: 10pt;">The resulting semantic vectors projection:</span></div><div align="center"><br />
<table border="1" cellpadding="0" cellspacing="0" class="MsoTableList4" style="border-collapse: collapse; border: medium none; width: 407px;"><tbody>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: black -moz-use-text-color black black; border-style: solid none solid solid; border-width: 1.5pt medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 71.25pt;" valign="top" width="95"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: black -moz-use-text-color; border-style: solid none; border-width: 1.5pt medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div align="center" class="MsoNormal" style="text-align: center;"><span style="font-size: 10pt;">Item1</span></div></td> <td nowrap="nowrap" style="border-color: black -moz-use-text-color; border-style: solid none; border-width: 1.5pt medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div align="center" class="MsoNormal" style="text-align: center;"><span style="font-size: 10pt;">Item2</span></div></td> <td nowrap="nowrap" style="border-color: black -moz-use-text-color; border-style: solid none; border-width: 1.5pt medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.75in;" valign="top" width="72"><br />
<div align="center" class="MsoNormal" style="text-align: center;"><span style="font-size: 10pt;">Item3</span></div></td> <td style="border-color: black -moz-use-text-color; border-style: solid none; border-width: 1.5pt medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.25in;" valign="top" width="24"><br />
<div align="center" class="MsoNormal" style="text-align: center;"><br />
</div></td> <td style="border-color: black black black -moz-use-text-color; border-style: solid solid solid none; border-width: 1.5pt 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 1in;" valign="top" width="96"><br />
<div align="center" class="MsoNormal" style="text-align: center;"><b><span style="font-size: 10pt;">User Vector</span></b></div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 71.25pt;" valign="top" width="95"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.75in;" valign="top" width="72"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> <td style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.25in;" valign="top" width="24"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> <td style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 1in;" valign="top" width="96"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 71.25pt;" valign="top" width="95"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">User1</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.3</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.6</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.75in;" valign="top" width="72"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.9</span></div></td> <td style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.25in;" valign="top" width="24"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> <td style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 1in;" valign="top" width="96"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><b><span style="font-size: 10pt;">0.2</span></b></div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 71.25pt;" valign="top" width="95"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">User2</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.1</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.75in;" valign="top" width="72"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.4</span></div></td> <td style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.25in;" valign="top" width="24"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> <td style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 1in;" valign="top" width="96"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><b><span style="font-size: 10pt;">0.6</span></b></div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 71.25pt;" valign="top" width="95"><br />
<div class="MsoNormal"><span style="font-size: 10pt;">User3</span></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.75in;" valign="top" width="72"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><span style="font-size: 10pt;">0.7</span></div></td> <td style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.25in;" valign="top" width="24"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> <td style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 1in;" valign="top" width="96"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><b><span style="font-size: 10pt;">0.9</span></b></div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 71.25pt;" valign="top" width="95"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div class="MsoNormal"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.75in;" valign="top" width="72"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> <td style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.25in;" valign="top" width="24"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> <td style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 1in;" valign="top" width="96"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> </tr>
<tr style="height: 12.75pt;"> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black black; border-style: none none solid solid; border-width: medium medium 1.5pt 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 71.25pt;" valign="top" width="95"><br />
<div class="MsoNormal"><b><span style="font-size: 10pt;">Item<br />
Vector</span></b></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 45pt;" valign="top" width="60"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><b><span style="font-size: 10pt;">0.2875</span></b></div></td> <td nowrap="nowrap" style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.75in;" valign="top" width="72"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><b><span style="font-size: 10pt;">0.3944…</span></b></div></td> <td style="border-color: -moz-use-text-color -moz-use-text-color black; border-style: none none solid; border-width: medium medium 1.5pt; height: 12.75pt; padding: 0in 5.4pt; width: 0.25in;" valign="top" width="24"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> <td style="border-color: -moz-use-text-color black black -moz-use-text-color; border-style: none solid solid none; border-width: medium 1.5pt 1.5pt medium; height: 12.75pt; padding: 0in 5.4pt; width: 1in;" valign="top" width="96"><br />
<div align="right" class="MsoNormal" style="text-align: right;"><br />
</div></td> </tr>
</tbody></table><br />
<div style="text-align: left;"><span style="font-size: small;">Here is a very difficult-to-read graph of the relationships:</span></div><div style="text-align: left;"><br />
</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdZvwog8Ms_lICFtGAu_Hphj4y8jMPKYuqp38P0DwS3LSR0cp9bqa2FwogiPclbSCAkYiKCGkbA2_wyebIflWRG_r6pZqY_xYY47Yhu6IInMP9I6h6RlhrobOlTkUTyjP2UzsmeFeFvkQ/s1600/trash.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="170" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdZvwog8Ms_lICFtGAu_Hphj4y8jMPKYuqp38P0DwS3LSR0cp9bqa2FwogiPclbSCAkYiKCGkbA2_wyebIflWRG_r6pZqY_xYY47Yhu6IInMP9I6h6RlhrobOlTkUTyjP2UzsmeFeFvkQ/s320/trash.jpg" width="320" /></a></div><div style="text-align: left;"><br />
</div></div><div class="MsoNormal"><span style="font-size: small;"><b>Recommendations</b></span></div><span style="font-size: small;"><br />
</span><br />
<div class="MsoNormal"><span style="font-size: small;">The recommendations for the users are their vector’s distance from each item vector:</span></div><div class="MsoNormal" style="margin-left: 0.25in; text-indent: -0.25in;"><span style="font-size: small;">·<span style="font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal;"> <br />
</span></span><span style="font-size: small;">User1 would be most interested in Item3, and finally Item2.</span></div><div class="MsoNormal" style="margin-left: 0.25in; text-indent: -0.25in;"><span style="font-size: small;">·<span style="font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal;"> <br />
</span></span><span style="font-size: small;">User2 has interests the same order, but would find the whole list less interesting.</span></div><div class="MsoNormal" style="margin-left: 0.25in; text-indent: -0.25in;"><span style="font-size: small;">·<span style="font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal;"> <br />
</span></span><span style="font-size: small;">User3 would be most interested in Item2, then Item3.</span></div><div class="MsoNormal"><span style="font-size: small;"><br />
</span></div><div class="MsoNormal"><span style="font-size: small;"><b>Item-Item Similarity </b></span></div><span style="font-size: small;"><br />
</span><br />
<div class="MsoNormal"><span style="font-size: small;">The Item-Item similarity set is the distances between the<br />
item vectors. Unfortunately, since item 1 has only one preference, it has no<br />
vector projection.</span></div><div class="MsoNormal" style="margin-left: 0.25in; text-indent: -0.25in;"><span style="font-size: small;"><span style="font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant: normal; font-weight: normal; line-height: normal;"> <br />
</span></span><span style="font-size: small;">Item2:Item3 = .11</span></div><div class="MsoNormal"><span style="font-size: small;"><b>User Vectors</b></span></div><div><span style="font-size: small;"><br />
</span></div><div class="MsoNormal"><span style="font-size: small;">The User-User distances are random, there is nothing to be learned.</span></div><span style="font-size: small;"><br />
</span><br />
<div class="MsoNormal"><span style="font-size: small;"><b>Summary</b></span></div><span style="font-size: small;"><br />
</span><br />
<div class="MsoPlainText"><span style="font-size: small;">The Semantic Vectors algorithm takes a matrix of Row->Column relationships and creates a set or Row-Column distances and a set of Column-Column relationships in a new, common numerical space. In this example, we created two sets of recommenders, a User->Item recommender and an Item->Item recommender.</span></div><br />
</div>Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com0tag:blogger.com,1999:blog-3095049372219674792.post-81584032297849768772010-11-26T02:47:00.000-08:002010-11-26T17:42:43.110-08:00Semantic Vectors - Part 1<span style="font-family: "Trebuchet MS", sans-serif;">A matrix represents a set of relationships between different rows and different columns. For </span><span style="font-family: "Trebuchet MS", sans-serif;">some purposes it is useful to derive from the matrix a set of relationships between the rows, or between the columns. </span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">The <strong>Semantic Vectors</strong> algorithm projects a matrix onto two set of vectors, one set for the rows and one set for the columns.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<ul type="disc"><li class="MsoNormal" style="mso-layout-grid-align: none; mso-list: l0 level1 lfo2; tab-stops: list .5in;"><span style="font-family: "Trebuchet MS", sans-serif;">Each vector corresponds to one row or column of the matrix.</span></li>
<li class="MsoNormal" style="mso-layout-grid-align: none; mso-list: l0 level1 lfo2; tab-stops: list .5in;"><span style="font-family: "Trebuchet MS", sans-serif;">Each vector is a point in space.</span></li>
<li class="MsoNormal" style="mso-layout-grid-align: none; mso-list: l0 level1 lfo2; tab-stops: list .5in;"><span style="font-family: "Trebuchet MS", sans-serif;">The distance between any two vectors corresponds to the delta between those rows or columns.</span></li>
</ul><div class="MsoNormal" style="mso-layout-grid-align: none; mso-list: l0 level1 lfo2; tab-stops: list .5in;"><span style="font-family: "Trebuchet MS", sans-serif;">The two sets of vectors exist in two “parallel universes” but in</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">matching positions.</span></div><span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">Semantic Vectors uses <b>Random Projection</b> (see other posts) to achieve this. The algorithm is one-pass and is very simple and intuitive: each row vector "tugs" each column vector into a position that matches its relationship to all of the rows. The relative distances of the column vectors encode the average of their respective relationships with each row. (If the matrix has a rank of one, all of the projected vectors will be the same; all pairs will have a distance of zero.not sure about this.)</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<strong><span style="font-family: "Trebuchet MS", sans-serif;">Definition</span></strong><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">The formal definition is: for all (Row, Column, Value) in R(0 -> 1) create a random number for each Row and each Column. These two sets of numbers are positions in two 1-dimensional parallel universes. Now, allow the random position of each Row item attract or repel the random position of each</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">Column, with the Value as the magnitude of attraction or repulsion.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<div style="text-align: center;"><span style="font-family: "Trebuchet MS", sans-serif;">C (in the parallel universe) = C + SUM for all r: (values/(r - C))/2</span></div><span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><span style="font-family: "Trebuchet MS", sans-serif;">In calculating the new C, the old C is ignored. If C = 0</span><span style="font-family: "Trebuchet MS", sans-serif;">, then the function is:</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<div style="text-align: center;"><span style="font-family: "Trebuchet MS", sans-serif;">C = SUM for all r:R (value/r)/2</span></div><span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><span style="font-family: "Trebuchet MS", sans-serif;">The informal explanation </span><span style="font-family: "Trebuchet MS", sans-serif;">is that each row position tugs or repels the column position according to 1/2 of its value for C. C then moves one-half of the distances between C and each R. This guarantees that "lowest R" < C < "highest R".Each C is moved into a separate position in the C universe that corresponds to all of the R values for that C. The relative positions of all C express a (very) diluted representation of the relationships of all R v.s. all C.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">User 1, +1.5, Item 1</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">User 2, +1.0, Item 1</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">User 1 at random position 0.1</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">User 2 at random position 0.6</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">Item position = </span><span style="font-family: "Trebuchet MS", sans-serif;">(1.5/0.1)/2 + (1.0/0.6)/2 = 0.07 + 0.3 = <span style="font-size: large;">0.37</span></span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjv7nnthdao62PFJjwIiuhFp2sMn19q3uprQ9Gihne9lP-dO8TksXKnGDqreBaaD44sjcfdrfS97SfEVqExyUAKuskbfeOfFM99BmQhWq7ISx6L8pbmdH4NG72TC9Yg8pIUqoHej7yiuM/s1600/SemVecSketch.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="138" ox="true" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjv7nnthdao62PFJjwIiuhFp2sMn19q3uprQ9Gihne9lP-dO8TksXKnGDqreBaaD44sjcfdrfS97SfEVqExyUAKuskbfeOfFM99BmQhWq7ISx6L8pbmdH4NG72TC9Yg8pIUqoHej7yiuM/s320/SemVecSketch.jpg" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br />
</div><span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<b><span style="font-family: "Trebuchet MS", sans-serif;">Resolution</span></b><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">Regard each R and C position as a 1-dimensional vector. In the above algorithm, the projected vectors are length one. The relationships are represented with poor resolution. The secret to making Semantic Vectors useful is to do the above operation many times to create longer and longer vectors. Each index in the vector uses different random positions for R. With different random positions for R, the projected vectors will contain uncorrelated placements for each index of the C vectors. All of the vectors at index N contain the distances in the same poor resolution, but combined they provide a quite serviceable representation of the original matrix.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<b><span style="font-family: "Trebuchet MS", sans-serif; font-size: large;">Applications</span></b><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<b><span style="font-family: "Trebuchet MS", sans-serif;">Recommendation systems</span></b><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">If the rows are users, the columns are items, and the matrix is populated with preferences, the projected vectors will encode, in their respective distances, the similarity of two items. This is the </span><span style="font-family: "Trebuchet MS", sans-serif;">basis of the SemanticVectorDataModel class: it includes all row vectors and column vectors, and the distance between two column vectors is their relative similarity. Overlaying the universes, the closest item vector is the most interesting item for a user. Item-Item distances give that similarity. The Item vectors can be clustered.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<b><span style="font-family: "Trebuchet MS", sans-serif;"></span></b><br />
<b><span style="font-family: "Trebuchet MS", sans-serif;">Word collocation</span></b><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">The rows are documents, the columns are words, and the matrix contents are the “collocation function”. A simple function is the number of times that word (C) appears in document (R). The resulting C Vectors encode the commonality of words in different documents. For example: if the rows represent titles of Disney</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">movies, and the columns represent the words in those titles, the C Vector for "Snow" will be nearest the vector for "White". (Assuming here are no other movies with 'snow' or 'white' in the title; at this point</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">there may be.) To find the similarity of documents, use words for the rows and documents for the columns.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<b><span style="font-family: "Trebuchet MS", sans-serif;">Number of Dimensions</span></b><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">To find the right number of dimensions for your application by hand, pick an</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">epsilon and drop the dimension while it satisfies the epsilon. This is your</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">lowest usable dimension.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<b><span style="font-family: "Trebuchet MS", sans-serif; font-size: large;">Random Projection</span></b><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">Random Projection projects a matrix onto another matrix (including vectors as one-row</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">matrices.) The algorithm populates a matrix with random numbers, and multiplies</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">the source matrix with it. The result is a projected matrix with (roughly) the</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">same rank as the source matrix. This allows the information encoded in the</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">source matrix to be projected with surprisingly good faithfulness into a</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">projected matrix of any shape. That is, if the source matrix is very low rank</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">(with an epsilon) the projected matrix will have a similar but degraded rank.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<b><span style="font-family: "Trebuchet MS", sans-serif;">Uses</span></b><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">If the projected matrix is smaller than the source matrix, this serves as a form</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">of dimensional reduction. If a random vector projects a source matrix onto a vector,</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">the projected vector will encode the information with very poor resolution. But</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">the resolution will be greater than zero, and this may be enough for some</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">algorithms.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<strong><span style="font-family: "Trebuchet MS", sans-serif; font-size: large;">Implementation</span></strong><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<b><span style="font-family: "Trebuchet MS", sans-serif;">Order</span></b><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">Computation order is O(dimensions * rows * columns). Interior coefficients <em>(never forget the coefficients!)</em> are the time to iterate over the rows and columns, and the time to generate random numbers.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<b><span style="font-family: "Trebuchet MS", sans-serif;">Random Distributions and Distance</span></b><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">Euclidean or <city w:st="on"><place w:st="on">Manhattan istance measurements may both be equally acceptable. Sets of random numbers</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">added or multiplied quickly converge on a normal distribution. If you use a</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">linear distribution for the original random R positions, the distances between</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">vectors will be in a normal distribution anyway because the distances measures</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">add and multiply hundreds of random numbers. It will be easier to look at your</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">numbers in small test cases if you start with a normal distribution for your R</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">positions.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<strong><span style="font-family: "Trebuchet MS", sans-serif; font-size: large;">Mahout</span></strong><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<strong><span style="font-family: "Trebuchet MS", sans-serif;">RandomVector/RandomMatrix classes</span></strong><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">The RandomVector and RandomMatrix classes in (a Mahout patch) supply the random row positions above. Only the seed values need be stored to reproduce the random multipliers.</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<strong><span style="font-family: "Trebuchet MS", sans-serif;">SemanticVectors classes: DataModel/Recommenders/clusters</span></strong><br />
<span style="font-family: "Trebuchet MS", sans-serif;"><br />
</span><br />
<span style="font-family: "Trebuchet MS", sans-serif;">These create recommendations from a stored db of User and Item vector sets. The User vectors are created as needed from the seed configuration values.</span>Lance N.http://www.blogger.com/profile/05485293639426402989noreply@blogger.com1