Regression Tutorial¶

This guide will show how to use Tribuo’s regression models to predict wine quality based on the UCI Wine Quality data set. We’ll experiment with several different regression trainers: two for training linear models (SGD and Adagrad) and one for training a tree ensemble via Tribuo’s wrapper on XGBoost (note: Tribuo's XGBoost support relies upon the Maven Central XGBoost jar which contains macOS and Linux binaries, to run this tutorial on Windows please compile DMLC's XGBoost jar from source and rebuild Tribuo). We’ll run these experiments by simply swapping in different implementations of Tribuo’s Trainer interface. We’ll also show how to evaluate regression models and describe some common evaluation metrics.

Setup¶

First you'll need to download the winequality dataset from UCI:

wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

then we'll load in some jars and import a few packages.

%jars ./tribuo-json-4.0.2-jar-with-dependencies.jar
%jars ./tribuo-regression-sgd-4.0.2-jar-with-dependencies.jar
%jars ./tribuo-regression-xgboost-4.0.2-jar-with-dependencies.jar
%jars ./tribuo-regression-tree-4.0.2-jar-with-dependencies.jar

import java.nio.file.Path;
import java.nio.file.Paths;

import org.tribuo.*;
import org.tribuo.data.csv.CSVLoader;
import org.tribuo.datasource.ListDataSource;
import org.tribuo.evaluation.TrainTestSplitter;
import org.tribuo.math.optimisers.*;
import org.tribuo.regression.*;
import org.tribuo.regression.evaluation.*;
import org.tribuo.regression.sgd.RegressionObjective;
import org.tribuo.regression.sgd.linear.LinearSGDTrainer;
import org.tribuo.regression.sgd.objectives.SquaredLoss;
import org.tribuo.regression.rtree.CARTRegressionTrainer;
import org.tribuo.regression.xgboost.XGBoostRegressionTrainer;
import org.tribuo.util.Util;

Loading the data¶

In Tribuo, all the prediction types have an associated OutputFactory implementation, which can create the appropriate Output subclasses from an input. Here we're going to use RegressionFactory as we're performing regression. In Tribuo both single and multidimensional regression use the Regressor and RegressionFactory classes. We then pass the regressionFactory into the simple CSVLoader which reads all the columns into a DataSource. The winequality dataset uses ; to separate the columns rather than the standard , so we change the default separator character. Note if your csv file isn't purely numeric or you wish to use a subset of the columns as features then you should use CSVDataSource which allows fine-grained control over the loading and featurisation process of your csv file. There's a columnar data tutorial which details the flexibility and power of our columnar processing infrastructure.

var regressionFactory = new RegressionFactory();
var csvLoader = new CSVLoader<>(';',regressionFactory);

We don't have a pre-defined train test split, so we take 70% as the training data, and 30% as the test data. The data is randomised using the RNG seeded by the second value. Then we feed the split data sources into the training and testing datasets. These MutableDatasets manage all the metadata (e.g., feature & output domains), and the mapping from feature names to feature id numbers.

var wineSource = csvLoader.loadDataSource(Paths.get("winequality-red.csv"),"quality");
var splitter = new TrainTestSplitter<>(wineSource, 0.7f, 0L);
Dataset<Regressor> trainData = new MutableDataset<>(splitter.getTrain());
Dataset<Regressor> evalData = new MutableDataset<>(splitter.getTest());

Training the models¶

We're going to define a quick training function which accepts a trainer and a training dataset. It times the training and also prints the performance metrics. Evaluating on the training data is useful for debugging: if the model performs poorly in the training data, then we know something is wrong.

public Model<Regressor> train(String name, Trainer<Regressor> trainer, Dataset<Regressor> trainData) {
    // Train the model
    var startTime = System.currentTimeMillis();
    Model<Regressor> model = trainer.train(trainData);
    var endTime = System.currentTimeMillis();
    System.out.println("Training " + name + " took " + Util.formatDuration(startTime,endTime));
    // Evaluate the model on the training data (this is a useful debugging tool)
    RegressionEvaluator eval = new RegressionEvaluator();
    var evaluation = eval.evaluate(model,trainData);
    // We create a dimension here to aid pulling out the appropriate statistics.
    // You can also produce the String directly by calling "evaluation.toString()"
    var dimension = new Regressor("DIM-0",Double.NaN);
    System.out.printf("Evaluation (train):%n  RMSE %f%n  MAE %f%n  R^2 %f%n",
            evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));
    return model;
}

Now we're going to define an equivalent testing function which accepts a model and a test dataset, printing the performance to std out.

public void evaluate(Model<Regressor> model, Dataset<Regressor> testData) {
    // Evaluate the model on the test data
    RegressionEvaluator eval = new RegressionEvaluator();
    var evaluation = eval.evaluate(model,testData);
    // We create a dimension here to aid pulling out the appropriate statistics.
    // You can also produce the String directly by calling "evaluation.toString()"
    var dimension = new Regressor("DIM-0",Double.NaN);
    System.out.printf("Evaluation (test):%n  RMSE %f%n  MAE %f%n  R^2 %f%n",
            evaluation.rmse(dimension), evaluation.mae(dimension), evaluation.r2(dimension));
}

Now we'll define the four trainers we're going to compare.

A linear regression trained using linear decay SGD.
A linear regression trained using SGD and AdaGrad.
A regression tree using the CART algorithm with a maximum depth of 6.
An XGBoost trainer using 50 rounds of boosting.

var lrsgd = new LinearSGDTrainer(
    new SquaredLoss(), // loss function
    SGD.getLinearDecaySGD(0.01), // gradient descent algorithm
    10,                // number of training epochs
    trainData.size()/4,// logging interval
    1,                 // minibatch size
    1L                 // RNG seed
);
var lrada = new LinearSGDTrainer(
    new SquaredLoss(),
    new AdaGrad(0.01),
    10,
    trainData.size()/4,
    1,
    1L 
);
var cart = new CARTRegressionTrainer(6);
var xgb = new XGBoostRegressionTrainer(50);

First we'll train the linear regression with SGD:

var lrsgdModel = train("Linear Regression (SGD)",lrsgd,trainData);

Training Linear Regression (SGD) took (00:00:00:097)
Evaluation (train):
  RMSE 0.979522
  MAE 0.741870
  R^2 -0.471611

Evaluating the models¶

Using our evaluation function this is pretty straightforward.

evaluate(lrsgdModel,evalData);

Evaluation (test):
  RMSE 0.967450
  MAE 0.720619
  R^2 -0.439255

Those numbers seem poor, but what do these evaluation metrics mean?

RMSE¶

The root-mean-square error (RMSE) summarizes the magnitude of errors between our regression model's predictions and the values we observe in our data. Basically, RMSE is the standard deviation of model prediction errors on a given dataset.

$$RMSE = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }$$

Lower is better: a perfect model for the wine data would have RMSE=0. The RMSE is sensitive to how large an error was, and is thus sensitive to outliers. This also means that RMSE can be used to compare different models on the same dataset but not across different datasets, as a "good" RMSE value on one dataset might be larger than a "good" RMSE value on a different dataset. See Wikipedia for more info on RMSE.

MAE¶

The mean absolute error (MAE) is another summary of model error. Unlike RMSE, each error in MAE contributes proportional to its absolute value.

$$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

R^2¶

The R-squared metric (also called the "coefficient of determination") summarizes how much of the variation in observed outcomes can be explained by our model.

Let $\bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_i$, i.e., the mean deviation of observed data points from the observed mean. R^2 is given by:

$$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$$

A value of R^2=1 means that the model accounts for all of the variation in a set of observations -- in other words, it fits a dataset perfectly. Note that R^2 can turn negative when the sum-of-squared model errors (numerator) is greater than the sum-of-squared differences between observed data points and the observed mean (denominator). In other words, when R^2 is negative, the model fits the data worse than simply using the observed mean to predict values.

See Wikipedia and the Minitab blog for more detailed discussion of R^2.

Improving over standard SGD with AdaGrad¶

It's not surprising the SGD results are bad: in linear decay SGD, the step size used for parameter updates changes over time (training iterations) but is uniform across all model parameters. This means that we use the same step size for a noisy/irrelevant feature as we would for an informative feature. There many more sophisticated approaches to gradient descent.

One of these is AdaGrad, modifies the "global" learning rate for each parameter $p$ using the sum-of-squares of past gradients w.r.t. $p$, up to time $t$.

... the secret sauce of AdaGrad is not on necessarily accelerating gradient descent with a better step size selection, but making gradient descent more stable to not-so-good (\eta) choices. Anastasios Kyrillidis, Note on AdaGrad

Let's try training for the same number of epochs using AdaGrad instead of LinearDecaySGD:

var lradaModel = train("Linear Regression (AdaGrad)",lrada,trainData);
evaluate(lradaModel,evalData);

Training Linear Regression (AdaGrad) took (00:00:00:079)
Evaluation (train):
  RMSE 0.735311
  MAE 0.575096
  R^2 0.170709
Evaluation (test):
  RMSE 0.737994
  MAE 0.585709
  R^2 0.162497

Using a more robust optimizer got us a better fit in the same number of epochs. However, both the train and test R^2 scores are still substantially less than 1 and, as before, the train and test RMSE scores are very similar.

See here and here for more on AdaGrad. Also, there are many other implementations of various well-known optimizers in Tribuo, including Adam and RMSProp. See the math.optimisers package.

At this point, we showed that we can improve our model by using a more robust optimizer; however, we're still using a linear model. If there are informative, non-linear relationships among wine quality features, then our current model won't be able to take advantage of them. We'll finish this tutorial by showing how to use a couple of popular non-linear models, CART and XGBoost.

Trees and ensembles¶

Next we'll train the CART tree:

var cartModel = train("CART",cart,trainData);
evaluate(cartModel,evalData);

Training CART took (00:00:00:066)
Evaluation (train):
  RMSE 0.545205
  MAE 0.406670
  R^2 0.544085
Evaluation (test):
  RMSE 0.657900
  MAE 0.494812
  R^2 0.334420

Finally we'll train the XGBoost ensemble:

var xgbModel = train("XGBoost",xgb,trainData);
evaluate(xgbModel,evalData);

Training XGBoost took (00:00:01:135)
Evaluation (train):
  RMSE 0.143871
  MAE 0.097167
  R^2 0.968252
Evaluation (test):
  RMSE 0.599478
  MAE 0.426673
  R^2 0.447378

Using gradient boosting via XGBoost improved results by a lot. Not only are the train & test fits better, but the train and test RMSE have started to diverge, indicating that the XGBoost model isn't underfitting like the previous two linear models were. XGBoost won't always be the best model for your data, but it's often a great baseline model to try when facing a new problem or dataset.

Conclusion¶

In this tutorial, we showed how to experiment with several different regression trainers (linear decay SGD, AdaGrad, CART, XGBoost). It was easy to experiment with different trainers and models by simply swapping in different implementations of the Tribuo Trainer interface. We also showed how to evaluate regression models and described some common evaluation metrics.