Friday, November 26, 2010

Semantic Vectors - Part 1

A matrix represents a set of relationships between different rows and different columns. For some purposes it is useful to derive from the matrix a set of relationships between the rows, or between the columns.


The Semantic Vectors algorithm projects a matrix onto two set of vectors, one set for the rows and one set for the columns.


  • Each vector corresponds to one row or column of the matrix.
  • Each vector is a point in space.
  • The distance between any two vectors corresponds to the delta between those rows or columns.
The two sets of vectors exist in two “parallel universes” but in
matching positions.


Semantic Vectors uses Random Projection (see other posts) to achieve this. The algorithm is one-pass and is very simple and intuitive: each row vector "tugs" each column vector into a position that matches its relationship to all of the rows. The relative distances of the column vectors encode the average of their respective relationships with each row. (If the matrix has a rank of one, all of the projected vectors will be the same; all pairs will have a distance of zero.not sure about this.)


Definition


The formal definition is: for all (Row, Column, Value) in R(0 -> 1) create a random number for each Row and each Column. These two sets of numbers are positions in two 1-dimensional parallel universes. Now, allow the random position of each Row item attract or repel the random position of each
Column, with the Value as the magnitude of attraction or repulsion.


C (in the parallel universe) = C + SUM for all r: (values/(r - C))/2

In calculating the new C, the old C is ignored. If C = 0, then the function is:


C = SUM for all r:R (value/r)/2

The informal explanation is that each row position tugs or repels the column position according to 1/2 of its value for C. C then moves one-half of the distances between C and each R. This guarantees that "lowest R" < C < "highest R".Each C is moved into a separate position in the C universe that corresponds to all of the R values for that C. The relative positions of all C express a (very) diluted representation of the relationships of all R v.s. all C.


User 1, +1.5, Item 1
User 2, +1.0, Item 1


User 1 at random position 0.1
User 2 at random position 0.6
Item position = (1.5/0.1)/2 + (1.0/0.6)/2 = 0.07 + 0.3 = 0.37







Resolution


Regard each R and C position as a 1-dimensional vector. In the above algorithm, the projected vectors are length one. The relationships are represented with poor resolution. The secret to making Semantic Vectors useful is to do the above operation many times to create longer and longer vectors. Each index in the vector uses different random positions for R. With different random positions for R, the projected vectors will contain uncorrelated placements for each index of the C vectors. All of the vectors at index N contain the distances in the same poor resolution, but combined they provide a quite serviceable representation of the original matrix.


Applications


Recommendation systems


If the rows are users, the columns are items, and the matrix is populated with preferences, the projected vectors will encode, in their respective distances, the similarity of two items. This is the basis of the SemanticVectorDataModel class: it includes all row vectors and column vectors, and the distance between two column vectors is their relative similarity. Overlaying the universes, the closest item vector is the most interesting item for a user. Item-Item distances give that similarity. The Item vectors can be clustered.



Word collocation


The rows are documents, the columns are words, and the matrix contents are the “collocation function”. A simple function is the number of times that word (C) appears in document (R). The resulting C Vectors encode the commonality of words in different documents. For example: if the rows represent titles of Disney
movies, and the columns represent the words in those titles, the C Vector for "Snow" will be nearest the vector for "White". (Assuming here are no other movies with 'snow' or 'white' in the title; at this point
there may be.) To find the similarity of documents, use words for the rows and documents for the columns.


Number of Dimensions


To find the right number of dimensions for your application by hand, pick an
epsilon and drop the dimension while it satisfies the epsilon. This is your
lowest usable dimension.


Random Projection


Random Projection projects a matrix onto another matrix (including vectors as one-row
matrices.) The algorithm populates a matrix with random numbers, and multiplies
the source matrix with it. The result is a projected matrix with (roughly) the
same rank as the source matrix. This allows the information encoded in the
source matrix to be projected with surprisingly good faithfulness into a
projected matrix of any shape. That is, if the source matrix is very low rank
(with an epsilon) the projected matrix will have a similar but degraded rank.


Uses


If the projected matrix is smaller than the source matrix, this serves as a form
of dimensional reduction. If a random vector projects a source matrix onto a vector,
the projected vector will encode the information with very poor resolution. But
the resolution will be greater than zero, and this may be enough for some
algorithms.


Implementation


Order


Computation order is O(dimensions * rows * columns). Interior coefficients (never forget the coefficients!) are the time to iterate over the rows and columns, and the time to generate random numbers.


Random Distributions and Distance


Euclidean or Manhattan istance measurements may both be equally acceptable. Sets of random numbers
added or multiplied quickly converge on a normal distribution. If you use a
linear distribution for the original random R positions, the distances between
vectors will be in a normal distribution anyway because the distances measures
add and multiply hundreds of random numbers. It will be easier to look at your
numbers in small test cases if you start with a normal distribution for your R
positions.


Mahout


RandomVector/RandomMatrix classes


The RandomVector and RandomMatrix classes in (a Mahout patch) supply the random row positions above. Only the seed values need be stored to reproduce the random multipliers.


SemanticVectors classes: DataModel/Recommenders/clusters


These create recommendations from a stored db of User and Item vector sets. The User vectors are created as needed from the seed configuration values.

1 comment:

  1. This is the Semantic Vectors implementation that inspired this project:

    Code repository
    http://code.google.com/p/semanticvectors/

    Dominic Widdows,Kathleen Ferraro
    General Description:
    http://www.google.com/research/pubs/archive/33477.pdf

    Package Documentation for Google code:
    http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en/us/research/pubs/archive/33477.pdf

    ReplyDelete