Wednesday, November 10, 2010

KMeans cluster testing

This is from one set of tests for the validity of a data manipulation exercise. All four charts are from 2D projections (via MDS) from a 150-dimension space.

The raw data was generated from training data, test data, and random data. A Canopy clustering algorithm generated 40 clusters from the training set. This creates a rough approximation for the clustering. All three data sets were clustered with KMeans starting with this set. These charts are the clustered output.

This first chart is the KMeans output from the canopy vector set applied against the training data itself.



The raw data all have a normal distribution, and so one would expect the training data KMeans output to also have a normal distribution. And so it does. Oddly, it also has a spiral shape outward. Is this a quirk of Canopy, KMeans, both, the combination?

The second chart is the test data clustered using KMeans, but with the Canopy starting set from the training data.



The third chart is the KMeans/canopy process applied to random vectors with the same distribution.



The fourth chart is the canopy point set itself. The canopy set is 33 vectors, while the KMeans outputs are all 40 vectors. The above charts are all in the same space, but these vectors came out scaled differently.




I interpret all of this to mean that my training vectors have real information. The test data run through the training Canopy/KMeans grinder came out with roughly the same shape. The random data came out a little more scrambled.

I'm intrigued by the canopy distribution. Does the KMeans algorithm essentially spin the canopy output?

[The Canopy and KMeans are Mahout's in-memory versions. The center vectors from KMeans are plotted (instead of centroid vectors). The charts and MDS reduction was done in KNime (www.knime.org).]

No comments:

Post a Comment