I started with vectors which represent items in a data model. (See my Semantic Vectors posts.) They are from 100k entries in the GroupLens film ratings dataset, representing ratings by 669 users on 3274 items. The 200 vectors are semi-randomly chosen from the item pool. All are 200 dimensions, dense. Euclidean distances signify nearness for Item-Item recommendation. The item-item distances give good ratings in the 150-200 dimension range, so it is reasonable to pick 200 dimensions as having rich support. The subset of item vectors have strong user-user bonds, so the recommendation structure is somewhat visible. This was a mistake, but I like it.
The project uses simple 2D visualization to decide whether random projection is as good as other, older methods of dimensional reduction. I used KNime's MDS implementation. MDS (Multi-Dimensional Scaling) is an older algorithm for creating a projection from a high dimensional space to a low dimensional, retaining pairwise distances. MDS is tuneable for the amount of information dropped.
Each test run uses the same 200 vectors in the same order. Each vector has the same color each time (from vivid green to black to vivid red).
The main axis of investigation is between three projection methods: complete Random Projection, partial Random Projection, and complete MDS. The following scatter charts show the vectors transformed to 2d vectors, and shown in a scatter-plot (X and Y are generated dimensions, color is the vector).
- Random projection from 200 dimensions to 2 dimensions.
- Random projection from 200 dimensions to 20 dimensions, then use MDS to boil down to 2 dimensions.
- Use MDS to convert from 200 dimensions to two dimensions.
Full Random Projection from 200 dimensions to 2 dimensions (Gaussian)
Random Projection from 200D to 20D, then MDS to 2D (Gaussian, KNime MDS implementation)
Full MDS from 200d to 2d (KNime MDS Implementation)
- Using full random projection does not completely trash the data, but it's challenging to interpret.
- Downsizing the data to 20 dimensions and then using MDS the rest of the way is somewhat cleaner, and good enough to work with. It's also faster.
- Full MDS shows a strikingly clear picture of the item-item structure.
Onward to Part 2: there is more than one random projection.
I did these diagrams with the KNime visual programming app for data mining. All Hail KNime!