Saturday, November 27, 2010

Semantic Vectors for Recommenders

Semantic Vectors for Recommenders

 Start with a User->Item preference matrix. The values range from -0.5 to 0.5 in “linear” interpretation: 0 is neutral. -0.5 is twice as negative as -0.25. Blank values default to 0.




Item1

Item2

Item3

User1

0.3

0.6

0.9

User2



0.1

0.4

User3





0.7

Now, let's do a semantic vectors projection. The User vector is random:




Random

User1

0.2

User2

0.6

User3

0.9

The formula for the Item outputs is:
Item(i) =  ((sum(U)+ sum(pref(u,i)/#U))/2)/#U

Where #U = the number of users expressing a preference for Item(i)


I1

Not enough preferences

I2

(((U1 + U2) + ((pref( u1,i2)  + pref(u2,i2))/#U))/2)/#U

I3

(((U1 + U2 + U3) + ((pref( u1,i3)  + pref(u2,i3) + pref(u3,i3))/#U)/2//#U



I1

No vector

I2

(((0.2 + 0.6) + (0.6 + 0.1)/2)/2)/2

I3

(((0.2 + 0.6 + 0.9) + (0.9 + 0.4 + 0.7)/3)/2)/3



I1

No vector

I2

0.2875

I3

0.3944…..


The resulting semantic vectors projection:




Item1

Item2

Item3



User Vector













User1

0.3

0.6

0.9



0.2

User2



0.1

0.4



0.6

User3





0.7



0.9













Item
Vector



0.2875

0.3944…





Here is a very difficult-to-read graph of the relationships:


Recommendations


The recommendations for the users are their vector’s distance from each item vector:
·  
User1 would be most interested in Item3, and finally Item2.
·  
User2 has interests the same order, but would find the whole list less interesting.
·  
User3 would be most interested in Item2, then Item3.

Item-Item Similarity


The Item-Item similarity set is the distances between the
item vectors. Unfortunately, since item 1 has only one preference, it has no
vector projection.
       
Item2:Item3 = .11
User Vectors

The User-User distances are random, there is nothing to be learned.


Summary


The Semantic Vectors algorithm takes a matrix of Row->Column relationships and creates a set or Row-Column distances and a set of Column-Column relationships in a new, common numerical space.  In this example, we created two sets of recommenders, a User->Item recommender and an Item->Item recommender.

Friday, November 26, 2010

Semantic Vectors - Part 1

A matrix represents a set of relationships between different rows and different columns. For some purposes it is useful to derive from the matrix a set of relationships between the rows, or between the columns.


The Semantic Vectors algorithm projects a matrix onto two set of vectors, one set for the rows and one set for the columns.


  • Each vector corresponds to one row or column of the matrix.
  • Each vector is a point in space.
  • The distance between any two vectors corresponds to the delta between those rows or columns.
The two sets of vectors exist in two “parallel universes” but in
matching positions.


Semantic Vectors uses Random Projection (see other posts) to achieve this. The algorithm is one-pass and is very simple and intuitive: each row vector "tugs" each column vector into a position that matches its relationship to all of the rows. The relative distances of the column vectors encode the average of their respective relationships with each row. (If the matrix has a rank of one, all of the projected vectors will be the same; all pairs will have a distance of zero.not sure about this.)


Definition


The formal definition is: for all (Row, Column, Value) in R(0 -> 1) create a random number for each Row and each Column. These two sets of numbers are positions in two 1-dimensional parallel universes. Now, allow the random position of each Row item attract or repel the random position of each
Column, with the Value as the magnitude of attraction or repulsion.


C (in the parallel universe) = C + SUM for all r: (values/(r - C))/2

In calculating the new C, the old C is ignored. If C = 0, then the function is:


C = SUM for all r:R (value/r)/2

The informal explanation is that each row position tugs or repels the column position according to 1/2 of its value for C. C then moves one-half of the distances between C and each R. This guarantees that "lowest R" < C < "highest R".Each C is moved into a separate position in the C universe that corresponds to all of the R values for that C. The relative positions of all C express a (very) diluted representation of the relationships of all R v.s. all C.


User 1, +1.5, Item 1
User 2, +1.0, Item 1


User 1 at random position 0.1
User 2 at random position 0.6
Item position = (1.5/0.1)/2 + (1.0/0.6)/2 = 0.07 + 0.3 = 0.37







Resolution


Regard each R and C position as a 1-dimensional vector. In the above algorithm, the projected vectors are length one. The relationships are represented with poor resolution. The secret to making Semantic Vectors useful is to do the above operation many times to create longer and longer vectors. Each index in the vector uses different random positions for R. With different random positions for R, the projected vectors will contain uncorrelated placements for each index of the C vectors. All of the vectors at index N contain the distances in the same poor resolution, but combined they provide a quite serviceable representation of the original matrix.


Applications


Recommendation systems


If the rows are users, the columns are items, and the matrix is populated with preferences, the projected vectors will encode, in their respective distances, the similarity of two items. This is the basis of the SemanticVectorDataModel class: it includes all row vectors and column vectors, and the distance between two column vectors is their relative similarity. Overlaying the universes, the closest item vector is the most interesting item for a user. Item-Item distances give that similarity. The Item vectors can be clustered.



Word collocation


The rows are documents, the columns are words, and the matrix contents are the “collocation function”. A simple function is the number of times that word (C) appears in document (R). The resulting C Vectors encode the commonality of words in different documents. For example: if the rows represent titles of Disney
movies, and the columns represent the words in those titles, the C Vector for "Snow" will be nearest the vector for "White". (Assuming here are no other movies with 'snow' or 'white' in the title; at this point
there may be.) To find the similarity of documents, use words for the rows and documents for the columns.


Number of Dimensions


To find the right number of dimensions for your application by hand, pick an
epsilon and drop the dimension while it satisfies the epsilon. This is your
lowest usable dimension.


Random Projection


Random Projection projects a matrix onto another matrix (including vectors as one-row
matrices.) The algorithm populates a matrix with random numbers, and multiplies
the source matrix with it. The result is a projected matrix with (roughly) the
same rank as the source matrix. This allows the information encoded in the
source matrix to be projected with surprisingly good faithfulness into a
projected matrix of any shape. That is, if the source matrix is very low rank
(with an epsilon) the projected matrix will have a similar but degraded rank.


Uses


If the projected matrix is smaller than the source matrix, this serves as a form
of dimensional reduction. If a random vector projects a source matrix onto a vector,
the projected vector will encode the information with very poor resolution. But
the resolution will be greater than zero, and this may be enough for some
algorithms.


Implementation


Order


Computation order is O(dimensions * rows * columns). Interior coefficients (never forget the coefficients!) are the time to iterate over the rows and columns, and the time to generate random numbers.


Random Distributions and Distance


Euclidean or Manhattan istance measurements may both be equally acceptable. Sets of random numbers
added or multiplied quickly converge on a normal distribution. If you use a
linear distribution for the original random R positions, the distances between
vectors will be in a normal distribution anyway because the distances measures
add and multiply hundreds of random numbers. It will be easier to look at your
numbers in small test cases if you start with a normal distribution for your R
positions.


Mahout


RandomVector/RandomMatrix classes


The RandomVector and RandomMatrix classes in (a Mahout patch) supply the random row positions above. Only the seed values need be stored to reproduce the random multipliers.


SemanticVectors classes: DataModel/Recommenders/clusters


These create recommendations from a stored db of User and Item vector sets. The User vectors are created as needed from the seed configuration values.

Monday, November 15, 2010

KMeans cluster testing - part the second

Now we come to the ulterior motive for this project: Locality-Sensitive Hashing. This means creating a set of regularly spaced "corners" in an N-dimensional space, and mapping all points to those corners. Think of it as slicing N-dimensional space into boxes: N-dimensional cubes are the easiest to imagine.

When I "hash" all of my training set against the "lower left corners" of the N-dimensional boxes, I get the same number of points as the original. I then clustered these "corners" via KMeans. The output looked as I would expect:


(I had to use KNime's KMeans for this chart.)



But now fun begins. When I cluster the corner points using the Canopies from the training points, I get this bizarre item:



 It clearly means something.

Wednesday, November 10, 2010

KMeans cluster testing

This is from one set of tests for the validity of a data manipulation exercise. All four charts are from 2D projections (via MDS) from a 150-dimension space.

The raw data was generated from training data, test data, and random data. A Canopy clustering algorithm generated 40 clusters from the training set. This creates a rough approximation for the clustering. All three data sets were clustered with KMeans starting with this set. These charts are the clustered output.

This first chart is the KMeans output from the canopy vector set applied against the training data itself.



The raw data all have a normal distribution, and so one would expect the training data KMeans output to also have a normal distribution. And so it does. Oddly, it also has a spiral shape outward. Is this a quirk of Canopy, KMeans, both, the combination?

The second chart is the test data clustered using KMeans, but with the Canopy starting set from the training data.



The third chart is the KMeans/canopy process applied to random vectors with the same distribution.



The fourth chart is the canopy point set itself. The canopy set is 33 vectors, while the KMeans outputs are all 40 vectors. The above charts are all in the same space, but these vectors came out scaled differently.




I interpret all of this to mean that my training vectors have real information. The test data run through the training Canopy/KMeans grinder came out with roughly the same shape. The random data came out a little more scrambled.

I'm intrigued by the canopy distribution. Does the KMeans algorithm essentially spin the canopy output?

[The Canopy and KMeans are Mahout's in-memory versions. The center vectors from KMeans are plotted (instead of centroid vectors). The charts and MDS reduction was done in KNime (www.knime.org).]

Why?

Because technical mailing lists don't allow picture attachments.