Monday, August 15, 2011

Singular vectors for recommendations

This is a project to research:
  1. Reproducing these results: http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/
  2. Correlating the feature vectors and singular values from an SVD-based recommender to the generated vectors for users and items.
This was inspired by a lecture by one of the top 5 in the Netflix contest: the guy demonstrated axes of interest: chick flix v.s. Star Trek, Harry Potter v.s. Stanley Kubrick, etc. These clusters in the full item space are at the endpoints of vectors which can be realized from the feature vectors and singular values.

TestOpposites.java

This program and the following chart are my recreation of the raw data from the article above. BTJF are the original user/item values used to create the projection: Ben, Jeff, Tom, Fred. Bob, Love and Hate are Bob from the article; Love and Hate are users who love and hate all six seasons.

"Singular" and "Singular Div" are the first two feature vector columns of the SVD left-hand matrix. They are orthogonal. In the later chart, "Shifted data", we will use them to find "axes of interest" for the different items.

Raw data:
Shifted Data:
And now, the magic. The space of users is centered to 0. The two feature vectors are mirrored across 0,0, and the two orthogonal singular/feature vectors are downsized by their singular values. Normalized to add up to zero, the first singular value is 0.6 and the second is 0.25. In this chart, we take the original positions of the feature vectors and multiply them by 0.6 (yellow triangles) and 0.25 (red asterisks). And, I've drawn lines between the downsized versions. Now we have the 4 original users who established this space, three new users who are projected into this space, and the two major axes of interest.

Observations:
  1. The four original users (Billy-Bob, Trimolchio, Jenga and Ferdinand) all had somewhat orthogonal item ratings, and come out in an arc. One of them is far from the others on a large circle, and the feature vectors make sense given the "gravity" of the four users on the circle.
  2. Bob also had an item rating vector with the same style of pluses&minuses as BJTF, and appears at an expected place.
  3. The singular feature vectors (yellow triangles, red asterisks) do give "axes of interest" that make sense v.s. BTJF and Bob.
  4. Love and Hate are the nearest to the two ends of the dominant axis of interest. They are also nearly between the endpoints of the axes of interest.
Conclusions:
  • This technique gives two results:
    • It supplies axes of interest.
    • It allows a new user to describe himself based on the major axes of interest.
  • There is a fine yellow line between Love and Hate.

Sunday, August 14, 2011

Sorted recommender data

These are some images from experiments in sorting a ratings data model. This post has a few sorted versions of the GroupLens 1million sample database. Green and red are sequence values in the sorted output; they help visualize dense to sparse. Red is dense, green is sparse. 5% of men have red-green colorblindness.

Sorted by user:

Sorted by item:


Sorted by user and item:

Why is this interesting?
The lower left, red corner, has popular items. The upper right has the "long tail". I am a long tail guy; I don't really care about popular movies. Japanese female assassin movies, BBC comedy, Brazilian horror movies are on the list. An aquaintance received recommendations from Amazon for French Post-Structuralist literature (I don't know either) and pornographic comic books. "Someone finally understands me!"
A recommendation for me should be biased to the upper right. This sorting gives one algorithm to add to the pile.

Details
Program: SortDataModel.java, ModelComparator.java and maybe some other things in this repository.
https://github.com/LanceNorskog/LSH-Hadoop/blob/master/extras/mahout/test/java/org/apache/mahout/cf/taste/impl/common/SortDataModel.java

Visuals by KNime.

Koren on recommenders - comments about temporal changes in rating data

http://videolectures.net/kdd09_koren_cftd/

Talks about how the timestamps for watching/rating events are important. In general, items change slowly in rating values, while users have sudden changes in rating values.

Slides stolen: