Anomaly Detection Tutorial

This guide will show how to use Tribuo’s anomaly detection models to find anomalous events in a toy dataset drawn from a mixture of Gaussians. We'll discuss the options in the LibSVM anomaly detection algorithm (using a one-class nu-SVM) and discuss evaluations for anomaly detection tasks.

Setup

We'll load in a jar and import a few packages.

In [1]:
%jars ./tribuo-anomaly-libsvm-4.2.0-jar-with-dependencies.jar
In [2]:
import org.tribuo.*;
import org.tribuo.util.Util;
import org.tribuo.anomaly.*;
import org.tribuo.anomaly.evaluation.*;
import org.tribuo.anomaly.example.GaussianAnomalyDataSource;
import org.tribuo.anomaly.libsvm.*;
import org.tribuo.common.libsvm.*;
In [3]:
var eval = new AnomalyEvaluator();

Dataset

Tribuo's anomaly detection package comes with a simple data source that samples data points from a mixture of two spherical gaussians. One gaussian is expected, and the other is anomalous. The fraction of each present in any given data source is controllable with the fractionAnomalous constructor argument. The means and variances of the expected and anomalous distributions are also controllable on construction or via configuration (see the Configuration tutorial for more details on Tribuo's configuration system). All the data points sampled from the second distribution are marked ANOMALOUS, and the other points are marked EXPECTED. These form the two classes for Tribuo's anomaly detection system. You can also use any of the standard data loaders to pull in anomaly detection data.

The LibSVM anomaly detection algorithm requires there are no anomalies in the training data, but this is not required in general for Tribuo's anomaly detection infrastructure.

We'll sample 2000 points for each dataset, the training data will be free of anomalies to make LibSVM happy and 20% of the test set will be anomalies.

In [4]:
var data = new MutableDataset<>(new GaussianAnomalyDataSource(2000,/* number of examples */
                                                              0.0f,/*fraction anomalous */
                                                              1L/* RNG seed */));
var test = new MutableDataset<>(new GaussianAnomalyDataSource(2000,0.2f,2L));

Model Training

We'll fit a one-class SVM to our training data, and then use that to determine what things in our test set are anomalous. We'll use an RBF Kernel, and set the kernel width to 1.0.

In [5]:
var params = new SVMParameters<>(new SVMAnomalyType(SVMAnomalyType.SVMMode.ONE_CLASS), KernelType.RBF);
params.setGamma(1.0);
params.setNu(0.1); 
var trainer = new LibSVMAnomalyTrainer(params);

Training is the same as other Tribuo prediction tasks, just call train and pass the training data.

In [6]:
var startTime = System.currentTimeMillis();
var model = trainer.train(data);
var endTime = System.currentTimeMillis();
System.out.println();
System.out.println("Training took " + Util.formatDuration(startTime,endTime));
*
optimization finished, #iter = 653
obj = 289.5926348816893, rho = 3.144570476807895
nSV = 296, nBSV = 114

Training took (00:00:00:147)

Unfortunately the upstream LibSVM implementation is a little chatty and insists on writing to standard out, but after that we can see it took about 150ms to run (on my 2020 16" Macbook Pro, you may get slightly different runtimes). We can check how many support vectors are used by the SVM, from the training set of 2000:

In [7]:
((LibSVMAnomalyModel)model).getNumberOfSupportVectors()
Out[7]:
296

So we used 296 datapoints to model the density of the expected data.

Model evaluation

Tribuo's infrastructure treats anomaly detection as a binary classification problem with the fixed label set {EXPECTED,ANOMALOUS}. When we have ground truth labels we can thus measure the true positives (anomalous things predicted as anomalous), false positives (expected things predicted as anomalous), false negatives (anomalous things predicted as expected) and true negatives (expected things predicted as expected), though the latter number is not usually that useful. We can also calculate the usual summary statistics: precision, recall and F1 of the anomalous class. We're going to compare against the ground truth labels from the data generator.

In [8]:
var testEvaluation = eval.evaluate(model,test);
System.out.println(testEvaluation.toString());
System.out.println(testEvaluation.confusionString());
AnomalyEvaluation(tp=405 fp=232 tn=1363 fn=0 precision=0.635793 recall=1.000000 f1=0.777351)
              EXPECTED  ANOMALOUS
EXPECTED         1,363        232
ANOMALOUS            0        405

We can see that the model has no false negatives, and so perfect recall, but has a precision of 0.63, so approximately 63% of the positive predictions are true anomalies. This can be tuned by changing the width of the gaussian kernel which changes the range of values which are considered to be expected. The confusion matrix presents the same results in a more common form for classification tasks.

The 4.1 release added support for liblinear's anomaly detection, which is similar to LibSVM's anomaly detector using a linear kernel. We expect to add to Tribuo's set of anomaly detection algorithms over time, and we welcome contributions to expand them on our Github page.