K-Means Clustering Tutorial¶
This guide will show how to use one of Tribuo’s clustering models to find clusters in a toy dataset drawn from a mixture of Gaussians. We'll look at Tribuo's K-Means implementation and also discuss how evaluation works for clustering tasks.
Setup¶
We'll load in some jars and import a few packages.
%jars ./tribuo-clustering-kmeans-4.3.0-jar-with-dependencies.jar
import org.tribuo.*;
import org.tribuo.util.Util;
import org.tribuo.clustering.*;
import org.tribuo.clustering.evaluation.*;
import org.tribuo.clustering.example.GaussianClusterDataSource;
import org.tribuo.clustering.kmeans.*;
import org.tribuo.clustering.kmeans.KMeansTrainer.Initialisation;
import org.tribuo.math.distance.DistanceType;
var eval = new ClusteringEvaluator();
Dataset¶
Tribuo's clustering package comes with a simple data source that emits data sampled from a mixture of 5 2-dimensional Gaussians (the dimensionality of the Gaussians can be in the range 1 - 4, and the means & variances can also be set arbitrarily). This source sets the ground truth cluster IDs, so it can be used to measure clustering performance for demos like this. You can also use any of the standard data loaders to pull in clustering data.
As it conforms to the standard Trainer
and Model
interface used for the rest of Tribuo, the training of a clustering algorithm doesn't produce cluster assignments that are visible, to recover the assignments we need to call model.predict(trainData)
.
We're going to sample two datasets (using different seeds) one for fitting the cluster centroids, and one to measure clustering performance.
var data = new MutableDataset<>(new GaussianClusterDataSource(500, 1L));
var test = new MutableDataset<>(new GaussianClusterDataSource(500, 2L));
The defaults for the data source are:
N([ 0.0,0.0], [[1.0,0.0],[0.0,1.0]])
N([ 5.0,5.0], [[1.0,0.0],[0.0,1.0]])
N([ 2.5,2.5], [[1.0,0.5],[0.5,1.0]])
N([10.0,0.0], [[0.1,0.0],[0.0,0.1]])
N([-1.0,0.0], [[1.0,0.0],[0.0,0.1]])
Model Training¶
We'll first fit a K-Means using 5 centroids, a maximum of 10 iterations, using the euclidean distance and a single computation thread.
var l2Dist = DistanceType.L2.getDistance();
var trainer = new KMeansTrainer(5, /* centroids */
10, /* iterations */
l2Dist, /* distance function */
1, /* number of compute threads */
1 /* RNG seed */
);
var startTime = System.currentTimeMillis();
var model = trainer.train(data);
var endTime = System.currentTimeMillis();
System.out.println("Training with 5 clusters took " + Util.formatDuration(startTime,endTime));
We can inspect the centroids by querying the model.
var centroids = model.getCentroids();
for (var centroid : centroids) {
System.out.println(centroid);
}
These centroids line up pretty well with the Gaussian centroids. The predicted ones line up with the true ones as follows:
Predicted | True |
---|---|
1 | 5 |
2 | 3 |
3 | 1 |
4 | 2 |
5 | 4 |
Though the first one is a bit far out as it's "A" feature should be -1.0 not -1.7, and there is a little wobble in the rest. Still it's pretty good considering K-Means assumes spherical Gaussians and our data generator has a covariance matrix per Gaussian.
K-Means++¶
Tribuo also includes the K-Means++ initialisation algorithm, which we can run on our toy problem as follows:
var plusplusTrainer = new KMeansTrainer(5,10,l2Dist,Initialisation.PLUSPLUS,1,1);
var startTime = System.currentTimeMillis();
var plusplusModel = plusplusTrainer.train(data);
var endTime = System.currentTimeMillis();
System.out.println("Training with 5 clusters took " + Util.formatDuration(startTime,endTime));
The training time isn't much different in this case, but the K-Means++ initialisation does take longer than the default on larger datasets. However the resulting clusters are usually better.
We can check the centroids from this model using the same method as before.
var ppCentroids = plusplusModel.getCentroids();
for (var centroid : ppCentroids) {
System.out.println(centroid);
}
We can see in this case that the K-Means++ initialisation has warped the centroids slightly, so the fit isn't quite as nice as the default initialisation, but that's why we have evaluation data and measure model fit. K-Means++ usually improves the fit of a K-Means clustering, but it might be too complicated for this simple toy dataset.
Model evaluation¶
Tribuo uses the normalized mutual information to measure the quality of two clusterings. This avoids the issue that swapping the id number of any given centroid doesn't change the overall clustering. We're going to compare against the ground truth cluster labels from the data generator.
First for the training data:
var trainEvaluation = eval.evaluate(model,data);
trainEvaluation.toString();
Then for the unseen test data:
var testEvaluation = eval.evaluate(model,test);
testEvaluation.toString();
We see that as expected it's a pretty good correlation to the ground truth labels. K-Means (of the kind implemented in Tribuo) is similar to a Gaussian mixture using spherical Gaussians, and our data generator uses Gaussians with full rank covariances, so it won't be perfect.
We can also check the K-Means++ model in the same way:
var testPlusPlusEvaluation = eval.evaluate(plusplusModel,test);
testPlusPlusEvaluation.toString();
As expected with the slightly poorer quality centroids this initialisation gives then it's not got quite as good a fit. However we emphasise that K-Means++ usually improves the quality of the clustering, and so it's worth testing out if you're clustering data with Tribuo.
Multithreading¶
Tribuo's K-Means supports multi-threading of both the expectation and maximisation steps in the algorithm (i.e., the finding of the new centroids, and the assignment of points to centroids). We'll run the same experiment as before, both with 5 centroids and with 20 centroids, using 4 threads, though this time we'll use 2000 points for training.
var mtData = new MutableDataset<>(new GaussianClusterDataSource(2000, 1L));
var mtTrainer = new KMeansTrainer(5,10,l2Dist,4,1);
var mtStartTime = System.currentTimeMillis();
var mtModel = mtTrainer.train(mtData);
var mtEndTime = System.currentTimeMillis();
System.out.println("Training with 5 clusters on 4 threads took " + Util.formatDuration(mtStartTime,mtEndTime));
Now with 20 centroids:
var overTrainer = new KMeansTrainer(20,10,l2Dist,4,1);
var overStartTime = System.currentTimeMillis();
var overModel = overTrainer.train(mtData);
var overEndTime = System.currentTimeMillis();
System.out.println("Training with 20 clusters on 4 threads took " + Util.formatDuration(overStartTime,overEndTime));
We can evaluate the two models as before, using our ClusteringEvaluator
. First with 5 centroids:
var mtTestEvaluation = eval.evaluate(mtModel,test);
mtTestEvaluation.toString();
Then with 20:
var overTestEvaluation = eval.evaluate(overModel,test);
overTestEvaluation.toString();
We see that the multi-threaded versions run in about the same time as the single threaded trainer, but have 4 times the training data. The 20 centroid model has a tighter fit of the test data, though it is overparameterised. This is common in clustering tasks where it's hard to balance the model fitting with complexity. We'll look at adding more performance metrics so users can diagnose such issues in future releases.
Conclusion¶
We looked at clustering using Tribuo's K-Means implementation, experimented with different initialisations, and compared both the single-threaded and multi-threaded versions. Then we looked at the performance metrics available when there are ground truth clusterings.
We plan to further expand Tribuo's clustering functionality to incorporate other algorithms in the future, and added HDBSCAN in Tribuo v4.2. If you want to help, or have specific algorithmic requirements, file an issue on our github page.