Regression Tutorial

This guide will show how to use Tribuo’s regression models to predict wine quality based on the UCI Wine Quality data set. We’ll experiment with several different regression trainers: two for training linear models (SGD and Adagrad) and one for training a tree ensemble via Tribuo’s wrapper on XGBoost (note: Tribuo's XGBoost support relies upon the Maven Central XGBoost jar which only contains binaries for x86_64 platforms). We’ll run these experiments by simply swapping in different implementations of Tribuo’s Trainer interface. We’ll also show how to evaluate regression models and describe some common evaluation metrics.

Setup

First you'll need to download the winequality dataset from UCI:

wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

then we'll load in some jars and import a few packages.

In [1]:
%jars ./tribuo-json-4.3.0-jar-with-dependencies.jar
%jars ./tribuo-regression-sgd-4.3.0-jar-with-dependencies.jar
%jars ./tribuo-regression-xgboost-4.3.0-jar-with-dependencies.jar
%jars ./tribuo-regression-tree-4.3.0-jar-with-dependencies.jar
In [2]:
import java.nio.file.Path;
import java.nio.file.Paths;
In [3]:
import org.tribuo.*;
import org.tribuo.data.csv.CSVLoader;
import org.tribuo.datasource.ListDataSource;
import org.tribuo.evaluation.TrainTestSplitter;
import org.tribuo.math.optimisers.*;
import org.tribuo.regression.*;
import org.tribuo.regression.evaluation.*;
import org.tribuo.regression.sgd.RegressionObjective;
import org.tribuo.regression.sgd.linear.LinearSGDTrainer;
import org.tribuo.regression.sgd.objectives.SquaredLoss;
import org.tribuo.regression.rtree.CARTRegressionTrainer;
import org.tribuo.regression.xgboost.XGBoostRegressionTrainer;
import org.tribuo.util.Util;

Loading the data

In Tribuo, all the prediction types have an associated OutputFactory implementation, which can create the appropriate Output subclasses from an input. Here we're going to use RegressionFactory as we're performing regression. In Tribuo both single and multidimensional regression use the Regressor and RegressionFactory classes. We then pass the regressionFactory into the simple CSVLoader which reads all the columns into a DataSource. The winequality dataset uses ; to separate the columns rather than the standard , so we change the default separator character. Note if your csv file isn't purely numeric or you wish to use a subset of the columns as features then you should use CSVDataSource which allows fine-grained control over the loading and featurisation process of your csv file. There's a columnar data tutorial which details the flexibility and power of our columnar processing infrastructure.

In [4]:
var regressionFactory = new RegressionFactory();
var csvLoader = new CSVLoader<>(';',regressionFactory);

We don't have a pre-defined train test split, so we take 70% as the training data, and 30% as the test data. The data is randomised using the RNG seeded by the second value. Then we feed the split data sources into the training and testing datasets. These MutableDatasets manage all the metadata (e.g., feature & output domains), and the mapping from feature names to feature id numbers.

In [5]:
var wineSource = csvLoader.loadDataSource(Paths.get("winequality-red.csv"),"quality");
var splitter = new TrainTestSplitter<>(wineSource, 0.7f, 0L);
Dataset<Regressor> trainData = new MutableDataset<>(splitter.getTrain());
Dataset<Regressor> evalData = new MutableDataset<>(splitter.getTest());

Regression in Tribuo

Unlike most ML packages, regression in Tribuo is multidimensional by default. This means that each Regressor contains a vector of named values, which like the Feature objects are always kept sorted in lexicographic order. However unlike features, Regressor objects are dense, they always include all the dimensions, even if some are zero. In practice for an output type this isn't a strong restriction as if you're working in a multidimensional space then all dimensions will usually be present, and there are many fewer output dimensions than there are features.

Given this difference from other libraries, you might as why Tribuo does it this way? It's because it makes operating on probability distributions using regression algorithms significantly simpler. We do this in Tribuo's implementation of LIME for classification model explanations, and it will make future implementations of gradient boosting and similar algorithms much easier.

How does this affect users of Tribuo's regression package? Well, each Regressor is an Iterable<DimensionTuple>, and a DimensionTuple represents the name of a dimension, along with it's regressed value and a variance if present (unknown variances are set to Double.NaN). If you don't name the dimensions during data loading then they are automatically named DIM-0, ... DIM-N where N is one less than the number of dimensions. This means in the common case of single dimensional regression you'll want to access the first element of the various state accessors, or by asking the Regressor for DIM-0, or index 0.

In [6]:
Regressor r = trainData.getExample(0).getOutput();
System.out.println("Num dimensions = " + r.size());

String[] dimNames = r.getNames();
System.out.println("Dimension name: " + dimNames[0]);

double[] regressedValues = r.getValues();
System.out.println("Dimension value: " + regressedValues[0]);

// getDimension(String) returns an Optional<DimensionTuple>
Regressor.DimensionTuple tuple = r.getDimension("DIM-0").get();
System.out.println("Tuple = [" + tuple +"]");

// getDimension(int) throws IndexOutOfBoundsException if you give it a negative index
// or one greater than or equal to r.size()
Regressor.DimensionTuple tupleI = r.getDimension(0);
System.out.println("Regressor[0] = " + tupleI);
Num dimensions = 1
Dimension name: DIM-0
Dimension value: 6.0
Tuple = [DIM-0=6.0]
Regressor[0] = DIM-0=6.0

The prediction objects produced by Tribuo's regression models contain a single Regressor with a value for each dimension output the model knows about. As each Regressor represents a full vector there is no need for a collection of them to represent the full output space, unlike Label in multi-class classification.

Training the models

We're going to define a quick training function which accepts a trainer and a training dataset. It times the training and also prints the performance metrics. Evaluating on the training data is useful for debugging: if the model performs poorly in the training data, then we know something is wrong with either our model or our data.

In [7]:
public Model<Regressor> train(String name, Trainer<Regressor> trainer, Dataset<Regressor> trainData) {
    // Train the model
    var startTime = System.currentTimeMillis();
    Model<Regressor> model = trainer.train(trainData);
    var endTime = System.currentTimeMillis();
    System.out.println("Training " + name + " took " + Util.formatDuration(startTime,endTime));
    // Evaluate the model on the training data
    // This is a useful debugging tool to check the model actually learned something
    RegressionEvaluator eval = new RegressionEvaluator();
    var evaluation = eval.evaluate(model,trainData);
    // We create a dimension here to aid pulling out the appropriate statistics.
    // You can also produce the String directly by calling "evaluation.toString()"
    var dimension = new Regressor("DIM-0",Double.NaN);
    System.out.printf("Evaluation (train):%n  RMSE %f%n  MAE %f%n  R^2 %f%n",
            evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));
    return model;
}

Now we're going to define an equivalent testing function which accepts a model and a test dataset, printing the performance to std out.

In [8]:
public void evaluate(Model<Regressor> model, Dataset<Regressor> testData) {
    // Evaluate the model on the test data
    RegressionEvaluator eval = new RegressionEvaluator();
    var evaluation = eval.evaluate(model,testData);
    // We create a dimension here to aid pulling out the appropriate statistics.
    // You can also produce the String directly by calling "evaluation.toString()"
    var dimension = new Regressor("DIM-0",Double.NaN);
    System.out.printf("Evaluation (test):%n  RMSE %f%n  MAE %f%n  R^2 %f%n",
            evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));
}

Now we'll define the four trainers we're going to compare.

  • A linear regression trained using linear decay SGD.
  • A linear regression trained using SGD and AdaGrad.
  • A regression tree using the CART algorithm with a maximum depth of 6.
  • An XGBoost trainer using 50 rounds of boosting.
In [9]:
var lrsgd = new LinearSGDTrainer(
    new SquaredLoss(), // loss function
    SGD.getLinearDecaySGD(0.01), // gradient descent algorithm
    10,                // number of training epochs
    trainData.size()/4,// logging interval
    1,                 // minibatch size
    1L                 // RNG seed
);
var lrada = new LinearSGDTrainer(
    new SquaredLoss(),
    new AdaGrad(0.01),
    10,
    trainData.size()/4,
    1,
    1L 
);
var cart = new CARTRegressionTrainer(6);
var xgb = new XGBoostRegressionTrainer(50);

First we'll train the linear regression with SGD:

In [10]:
var lrsgdModel = train("Linear Regression (SGD)",lrsgd,trainData);
Training Linear Regression (SGD) took (00:00:00:051)
Evaluation (train):
  RMSE 0.979522
  MAE 0.741870
  R^2 -0.471611

Evaluating the models

Using our evaluation function this is pretty straightforward.

In [11]:
evaluate(lrsgdModel,evalData);
Evaluation (test):
  RMSE 0.967450
  MAE 0.720619
  R^2 -0.439255

Those numbers seem poor, but what do these evaluation metrics mean?

RMSE

The root-mean-square error (RMSE) summarizes the magnitude of errors between our regression model's predictions and the values we observe in our data. Basically, RMSE is the standard deviation of model prediction errors on a given dataset.

$$RMSE = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }$$

Lower is better: a perfect model for the wine data would have RMSE=0. The RMSE is sensitive to how large an error was, and is thus sensitive to outliers. This also means that RMSE can be used to compare different models on the same dataset but not across different datasets, as a "good" RMSE value on one dataset might be larger than a "good" RMSE value on a different dataset. See Wikipedia for more info on RMSE.

MAE

The mean absolute error (MAE) is another summary of model error. Unlike RMSE, each error in MAE contributes proportional to its absolute value.

$$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

R^2

The R-squared metric (also called the "coefficient of determination") summarizes how much of the variation in observed outcomes can be explained by our model.

Let $\bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_i$, i.e., the mean deviation of observed data points from the observed mean. R^2 is given by:

$$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$$

A value of R^2=1 means that the model accounts for all of the variation in a set of observations -- in other words, it fits a dataset perfectly. Note that R^2 can turn negative when the sum-of-squared model errors (numerator) is greater than the sum-of-squared differences between observed data points and the observed mean (denominator). In other words, when R^2 is negative, the model fits the data worse than simply using the observed mean to predict values.

See Wikipedia and the Minitab blog for more detailed discussion of R^2.

Improving over standard SGD with AdaGrad

It's not surprising the SGD results are bad: in linear decay SGD, the step size used for parameter updates changes over time (training iterations) but is uniform across all model parameters. This means that we use the same step size for a noisy/irrelevant feature as we would for an informative feature. There are many more sophisticated approaches to stochastic gradient descent.

One of these is AdaGrad, which modifies the "global" learning rate for each parameter $p$ using the sum-of-squares of past gradients w.r.t. $p$, up to time $t$.

... the secret sauce of AdaGrad is not on necessarily accelerating gradient descent with a better step size selection, but making gradient descent more stable to not-so-good (\eta) choices. Anastasios Kyrillidis, Note on AdaGrad

Let's try training for the same number of epochs using AdaGrad instead of LinearDecaySGD:

In [12]:
var lradaModel = train("Linear Regression (AdaGrad)",lrada,trainData);
evaluate(lradaModel,evalData);
Training Linear Regression (AdaGrad) took (00:00:00:024)
Evaluation (train):
  RMSE 0.735311
  MAE 0.575096
  R^2 0.170709
Evaluation (test):
  RMSE 0.737994
  MAE 0.585709
  R^2 0.162497

Using a more robust optimizer got us a better fit in the same number of epochs. However, both the train and test R^2 scores are still substantially less than 1 and, as before, the train and test RMSE scores are very similar.

See here and here for more on AdaGrad. Also, there are many other implementations of various well-known optimizers in Tribuo, including Adam and RMSProp. See the math.optimisers package.

At this point, we showed that we can improve our model by using a more robust optimizer; however, we're still using a linear model. If there are informative, non-linear relationships among wine quality features, then our current model won't be able to take advantage of them. We'll finish this tutorial by showing how to use a couple of popular non-linear models, CART and XGBoost.

Trees and ensembles

Next we'll train the CART tree:

In [13]:
var cartModel = train("CART",cart,trainData);
evaluate(cartModel,evalData);
Training CART took (00:00:00:092)
Evaluation (train):
  RMSE 0.544516
  MAE 0.405062
  R^2 0.545236
Evaluation (test):
  RMSE 0.658722
  MAE 0.494395
  R^2 0.332754

Finally we'll train the XGBoost ensemble:

In [14]:
var xgbModel = train("XGBoost",xgb,trainData);
evaluate(xgbModel,evalData);
Training XGBoost took (00:00:00:194)
Evaluation (train):
  RMSE 0.143871
  MAE 0.097167
  R^2 0.968252
Evaluation (test):
  RMSE 0.599478
  MAE 0.426673
  R^2 0.447378

Using gradient boosting via XGBoost improved results by a lot. Not only are the train & test fits better, but the train and test RMSE have started to diverge a little, indicating that the XGBoost model isn't underfitting like the two linear models were. XGBoost won't always be the best model for your data, but it's often a great baseline model to try when facing a new problem or dataset.

Conclusion

In this tutorial, we showed how to experiment with several different regression trainers (linear decay SGD, AdaGrad, CART, XGBoost). It was easy to experiment with different trainers and models by simply swapping in different implementations of the Tribuo Trainer interface. We also showed how to evaluate regression models and described some common evaluation metrics.