Anomaly Detection Tutorial

This guide will show how to use Tribuo’s anomaly detection models to find anomalous events in a toy dataset drawn from a mixture of Gaussians. We'll discuss the options in the LibSVM anomaly detection algorithm (using a one-class nu-SVM) and discuss evaluations for anomaly detection tasks.

Setup

We'll load in a jar and import a few packages.

In [1]:
%jars ./tribuo-anomaly-libsvm-4.0.1-jar-with-dependencies.jar
In [2]:
import org.tribuo.*;
import org.tribuo.util.Util;
import org.tribuo.anomaly.*;
import org.tribuo.anomaly.evaluation.*;
import org.tribuo.anomaly.example.AnomalyDataGenerator;
import org.tribuo.anomaly.libsvm.*;
import org.tribuo.common.libsvm.*;
In [3]:
var eval = new AnomalyEvaluator();

Dataset

Tribuo's anomaly detection package comes with a simple data generator that emits pairs of datasets containing 5 features. The training data is free from anomalies, and each example is sampled from a 5 dimensional gaussian with fixed mean and diagonal covariance. The test data is sampled from a mixture of two distributions, the first is the same as the training distribution, and the second uses a different mean for the gaussian (keeping the same covariance for simplicity). All the data points sampled from the second distribution are marked ANOMALOUS, and the other points are marked EXPECTED. These form the two classes for Tribuo's anomaly detection system. You can also use any of the standard data loaders to pull in anomaly detection data.

The libsvm anomaly detection algorithm requires there are no anomalies in the training data, but this is not required in general for Tribuo's anomaly detection infrastructure.

We'll sample 2000 points for each dataset, and 20% of the test set will be anomalies.

In [4]:
var pair = AnomalyDataGenerator.gaussianAnomaly(2000,0.2);
var data = pair.getA();
var test = pair.getB();

Model Training

We'll fit a one-class SVM to our training data, and then use that to determine what things in our test set are anomalous. We'll use an RBF Kernel, and set the kernel width to 1.0.

In [5]:
var params = new SVMParameters<>(new SVMAnomalyType(SVMAnomalyType.SVMMode.ONE_CLASS), KernelType.RBF);
params.setGamma(1.0);
params.setNu(0.1); 
var trainer = new LibSVMAnomalyTrainer(params);

Training is the same as other Tribuo prediction tasks, just call train and pass the training data.

In [6]:
var startTime = System.currentTimeMillis();
var model = trainer.train(data);
var endTime = System.currentTimeMillis();
System.out.println();
System.out.println("Training took " + Util.formatDuration(startTime,endTime));
*
optimization finished, #iter = 692
obj = 293.8182352369252, rho = 3.201748862633537
nSV = 301, nBSV = 120

Training took (00:00:00:169)

Unfortunately the LibSVM implementation is a little chatty and insists on writing to standard out, but after that we can see it took about 140ms to run (on my 2020 16" Macbook Pro, you may get slightly different runtimes). We can check how many support vectors are used by the SVM, from the training set of 2000:

In [7]:
((LibSVMAnomalyModel)model).getNumberOfSupportVectors()
Out[7]:
301

So we used 301 datapoints to model the density of the expected data.

Model evaluation

Tribuo's infrastructure treats anomaly detection as a binary classification problem with the fixed label set {EXPECTED,ANOMALOUS}. When we have ground truth labels we can thus measure the true positives (anomalous things predicted as anomalous), false positives (expected things predicted as anomalous), false negatives (anomalous things predicted as expected) and true negatives (expected things predicted as expected), though the latter number is not usually that useful. We can also calculate the usual summary statistics: precision, recall and F1 of the anomalous class. We're going to compare against the ground truth labels from the data generator.

In [8]:
var testEvaluation = eval.evaluate(model,test);
System.out.println(testEvaluation.toString());
System.out.println(testEvaluation.confusionString());
AnomalyEvaluation(tp=421 fp=250 tn=1329 fn=0 precision=0.627422 recall=1.000000 f1=0.771062)
              EXPECTED  ANOMALOUS
EXPECTED         1,329        250
ANOMALOUS            0        421

We can see that the model has no false negatives, and so perfect recall, but has a precision of 0.62, so approximately 62% of the positive predictions are true anomalies. This can be tuned by changing the width of the gaussian kernel which changes the range of values which are considered to be expected. The confusion matrix presents the same results in a more common form for classification tasks.

We plan to further expand Tribuo's anomaly detection functionality to incorporate other algorithms in the future.