K-Means Clustering Tutorial

This guide will show how to use one of Tribuo’s clustering models to find clusters in a toy dataset drawn from a mixture of Gaussians. We'll look at Tribuo's K-Means implementation and also discuss how evaluation works for clustering tasks.

Setup

We'll load in some jars and import a few packages.

In [1]:
%jars ./tribuo-clustering-kmeans-4.3.0-jar-with-dependencies.jar
In [2]:
import org.tribuo.*;
import org.tribuo.util.Util;
import org.tribuo.clustering.*;
import org.tribuo.clustering.evaluation.*;
import org.tribuo.clustering.example.GaussianClusterDataSource;
import org.tribuo.clustering.kmeans.*;
import org.tribuo.clustering.kmeans.KMeansTrainer.Initialisation;
import org.tribuo.math.distance.DistanceType;
In [3]:
var eval = new ClusteringEvaluator();

Dataset

Tribuo's clustering package comes with a simple data source that emits data sampled from a mixture of 5 2-dimensional Gaussians (the dimensionality of the Gaussians can be in the range 1 - 4, and the means & variances can also be set arbitrarily). This source sets the ground truth cluster IDs, so it can be used to measure clustering performance for demos like this. You can also use any of the standard data loaders to pull in clustering data.

As it conforms to the standard Trainer and Model interface used for the rest of Tribuo, the training of a clustering algorithm doesn't produce cluster assignments that are visible, to recover the assignments we need to call model.predict(trainData).

We're going to sample two datasets (using different seeds) one for fitting the cluster centroids, and one to measure clustering performance.

In [4]:
var data = new MutableDataset<>(new GaussianClusterDataSource(500, 1L));
var test = new MutableDataset<>(new GaussianClusterDataSource(500, 2L));

The defaults for the data source are:

  1. N([ 0.0,0.0], [[1.0,0.0],[0.0,1.0]])
  2. N([ 5.0,5.0], [[1.0,0.0],[0.0,1.0]])
  3. N([ 2.5,2.5], [[1.0,0.5],[0.5,1.0]])
  4. N([10.0,0.0], [[0.1,0.0],[0.0,0.1]])
  5. N([-1.0,0.0], [[1.0,0.0],[0.0,0.1]])

Model Training

We'll first fit a K-Means using 5 centroids, a maximum of 10 iterations, using the euclidean distance and a single computation thread.

In [5]:
var l2Dist = DistanceType.L2.getDistance();
var trainer = new KMeansTrainer(5, /* centroids */
                                10, /* iterations */
                                l2Dist, /* distance function */
                                1, /* number of compute threads */
                                1 /* RNG seed */
                               );
var startTime = System.currentTimeMillis();
var model = trainer.train(data);
var endTime = System.currentTimeMillis();
System.out.println("Training with 5 clusters took " + Util.formatDuration(startTime,endTime));
Training with 5 clusters took (00:00:00:071)

We can inspect the centroids by querying the model.

In [6]:
var centroids = model.getCentroids();
for (var centroid : centroids) {
    System.out.println(centroid);
}
[(A, -1.729407), (B, -0.019281)]
[(A, 2.740410), (B, 2.873769)]
[(A, 0.051021), (B, 0.075766)]
[(A, 5.174978), (B, 5.088150)]
[(A, 9.938804), (B, -0.020702)]

These centroids line up pretty well with the Gaussian centroids. The predicted ones line up with the true ones as follows:

Predicted True
1 5
2 3
3 1
4 2
5 4

Though the first one is a bit far out as it's "A" feature should be -1.0 not -1.7, and there is a little wobble in the rest. Still it's pretty good considering K-Means assumes spherical Gaussians and our data generator has a covariance matrix per Gaussian.

K-Means++

Tribuo also includes the K-Means++ initialisation algorithm, which we can run on our toy problem as follows:

In [7]:
var plusplusTrainer = new KMeansTrainer(5,10,l2Dist,Initialisation.PLUSPLUS,1,1);
var startTime = System.currentTimeMillis();
var plusplusModel = plusplusTrainer.train(data);
var endTime = System.currentTimeMillis();
System.out.println("Training with 5 clusters took " + Util.formatDuration(startTime,endTime));
Training with 5 clusters took (00:00:00:038)

The training time isn't much different in this case, but the K-Means++ initialisation does take longer than the default on larger datasets. However the resulting clusters are usually better.

We can check the centroids from this model using the same method as before.

In [8]:
var ppCentroids = plusplusModel.getCentroids();
for (var centroid : ppCentroids) {
    System.out.println(centroid);
}
[(A, -1.567863), (B, -0.029534)]
[(A, 9.938804), (B, -0.020702)]
[(A, 3.876203), (B, 3.930657)]
[(A, 0.399868), (B, 0.330537)]
[(A, 5.520480), (B, 5.390406)]

We can see in this case that the K-Means++ initialisation has warped the centroids slightly, so the fit isn't quite as nice as the default initialisation, but that's why we have evaluation data and measure model fit. K-Means++ usually improves the fit of a K-Means clustering, but it might be too complicated for this simple toy dataset.

Model evaluation

Tribuo uses the normalized mutual information to measure the quality of two clusterings. This avoids the issue that swapping the id number of any given centroid doesn't change the overall clustering. We're going to compare against the ground truth cluster labels from the data generator.

First for the training data:

In [9]:
var trainEvaluation = eval.evaluate(model,data);
trainEvaluation.toString();
Out[9]:
Clustering Evaluation
Normalized MI = 0.8128096132028937
Adjusted MI = 0.8113314999600724

Then for the unseen test data:

In [10]:
var testEvaluation = eval.evaluate(model,test);
testEvaluation.toString();
Out[10]:
Clustering Evaluation
Normalized MI = 0.8154291916732408
Adjusted MI = 0.8139169341974347

We see that as expected it's a pretty good correlation to the ground truth labels. K-Means (of the kind implemented in Tribuo) is similar to a Gaussian mixture using spherical Gaussians, and our data generator uses Gaussians with full rank covariances, so it won't be perfect.

We can also check the K-Means++ model in the same way:

In [11]:
var testPlusPlusEvaluation = eval.evaluate(plusplusModel,test);
testPlusPlusEvaluation.toString();
Out[11]:
Clustering Evaluation
Normalized MI = 0.7881995472105396
Adjusted MI = 0.7864797287891137

As expected with the slightly poorer quality centroids this initialisation gives then it's not got quite as good a fit. However we emphasise that K-Means++ usually improves the quality of the clustering, and so it's worth testing out if you're clustering data with Tribuo.

Multithreading

Tribuo's K-Means supports multi-threading of both the expectation and maximisation steps in the algorithm (i.e., the finding of the new centroids, and the assignment of points to centroids). We'll run the same experiment as before, both with 5 centroids and with 20 centroids, using 4 threads, though this time we'll use 2000 points for training.

In [12]:
var mtData = new MutableDataset<>(new GaussianClusterDataSource(2000, 1L));
var mtTrainer = new KMeansTrainer(5,10,l2Dist,4,1);
var mtStartTime = System.currentTimeMillis();
var mtModel = mtTrainer.train(mtData);
var mtEndTime = System.currentTimeMillis();
System.out.println("Training with 5 clusters on 4 threads took " + Util.formatDuration(mtStartTime,mtEndTime));
Training with 5 clusters on 4 threads took (00:00:00:053)

Now with 20 centroids:

In [13]:
var overTrainer = new KMeansTrainer(20,10,l2Dist,4,1);
var overStartTime = System.currentTimeMillis();
var overModel = overTrainer.train(mtData);
var overEndTime = System.currentTimeMillis();
System.out.println("Training with 20 clusters on 4 threads took " + Util.formatDuration(overStartTime,overEndTime));
Training with 20 clusters on 4 threads took (00:00:00:034)

We can evaluate the two models as before, using our ClusteringEvaluator. First with 5 centroids:

In [14]:
var mtTestEvaluation = eval.evaluate(mtModel,test);
mtTestEvaluation.toString();
Out[14]:
Clustering Evaluation
Normalized MI = 0.8104463467727057
Adjusted MI = 0.8088941747417295

Then with 20:

In [15]:
var overTestEvaluation = eval.evaluate(overModel,test);
overTestEvaluation.toString();
Out[15]:
Clustering Evaluation
Normalized MI = 0.8647317143685641
Adjusted MI = 0.8603214693630152

We see that the multi-threaded versions run in about the same time as the single threaded trainer, but have 4 times the training data. The 20 centroid model has a tighter fit of the test data, though it is overparameterised. This is common in clustering tasks where it's hard to balance the model fitting with complexity. We'll look at adding more performance metrics so users can diagnose such issues in future releases.

Conclusion

We looked at clustering using Tribuo's K-Means implementation, experimented with different initialisations, and compared both the single-threaded and multi-threaded versions. Then we looked at the performance metrics available when there are ground truth clusterings.

We plan to further expand Tribuo's clustering functionality to incorporate other algorithms in the future, and added HDBSCAN in Tribuo v4.2. If you want to help, or have specific algorithmic requirements, file an issue on our github page.