Configuration Tutorial¶
This tutorial will show how to use Tribuo's configuration and provenance systems to build models on MNIST (because we wouldn't be doing ML without an MNIST demo). We'll focus on logistic regression, show how many different trainers can be stored in the same configuration, and how the provenance system allows the configuration for a specific run to be regenerated. We'll also briefly look at Tribuo's feature transformation system and see how that integrates into configuration and provenance.
Setup¶
You'll need to get a copy of the MNIST dataset in the original IDX format.
First the training data:
wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Then the test data:
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Tribuo's IDX loader natively reads gzipped files so you don't need to unzip them.
It's Java, so first we load in the necessary Tribuo jars. Here we're using the classification experiments jar, along with the json interop jar to read and write the provenance information.
%jars ./tribuo-classification-experiments-4.1.0-jar-with-dependencies.jar
%jars ./tribuo-json-4.1.0-jar-with-dependencies.jar
Now lets import the packages we need. We'll use a few file manipulation things from Java, and then Tribuo's core packages, the transformation packages, the classification package, classification evaluation package, and then a few things that relate to the provenance system.
import java.nio.file.Files;
import java.nio.file.Paths;
import org.tribuo.*;
import org.tribuo.util.Util;
import org.tribuo.transform.*;
import org.tribuo.transform.transformations.LinearScalingTransformation;
import org.tribuo.classification.*;
import org.tribuo.classification.evaluation.*;
import com.oracle.labs.mlrg.olcut.config.Configurable;
import com.oracle.labs.mlrg.olcut.config.ConfigurationManager;
import com.oracle.labs.mlrg.olcut.config.DescribeConfigurable;
import com.oracle.labs.mlrg.olcut.provenance.*;
import com.oracle.labs.mlrg.olcut.provenance.primitives.*;
import com.oracle.labs.mlrg.olcut.config.json.JsonConfigFactory;
By default OLCUT's ConfigurationManager
only understands XML files, the snippet below adds JSON support to all ConfigurationManager
s in the running JVM. It can be added dynamically on the command line by supplying --config-file-format <fully-qualified-class-name>
where the class name is for example com.oracle.labs.mlrg.olcut.config.json.JsonConfigFactory
, if you're using OLCUT's CLI options processing.
ConfigurationManager.addFileFormatFactory(new JsonConfigFactory())
How does configuration work?¶
Tribuo uses a configuration system originally built in Sun Labs, open sourced in the OLCUT library. Classes which can be configured must implement the Configurable
interface, and optionally implement a public void postConfig()
method, which can be used to check invariants after a class has beeen configured but before it's visible. Configurable classes can mark which of their fields are available for configuration using the @Config
annotation, which accepts three arguments: boolean mandatory
if the configuration system should error out when the field is not configured, String description
a description of the field used as a comment and in the DescribeConfigurable
system seen below, and boolean redact
which controls if this field value should be saved into configuration files or written into provenance objects.
As configuration is part of the class file rather than the public documented API (because it operates on private fields), OLCUT ships with a CLI utility for inspecting a configurable class and generating an example configuration in any supported configuration format. To use this utility from the command line you can run:
$ java -cp <path-to-jars-including-olcut-core> com.oracle.labs.mlrg.olcut.config.DescribeConfigurable -n <class-name> -o -e xml
where the -n
argument denotes what class to describe, -o
denotes that an example configuration should be generated, and -e
gives the file format to emit the example configuration in.
You can also use the REPL to inspect a configurable class, like so:
var className = "org.tribuo.classification.sgd.linear.LinearSGDTrainer";
var clazz = (Class<? extends Configurable>) Class.forName(className);
Map map = DescribeConfigurable.generateFieldInfo(clazz);
var output = DescribeConfigurable.generateDescription(map);
System.out.println("Class: " + clazz.getCanonicalName() + "\n");
System.out.println(DescribeConfigurable.formatDescription(output));
And also to print out an example config file:
ByteArrayOutputStream writer = new ByteArrayOutputStream();
DescribeConfigurable.writeExampleConfig(writer,"json",clazz,map);
System.out.println(writer.toString("UTF-8"));
At the moment using it from the REPL is missing some type information in DescribeConfigurable.generateFieldInfo
, we'll fix that in the next OLCUT release.
Using a configuration file¶
We're going to read in an example configuration file, in JSON format. This configuration knows about a bunch of different trainers, and also the training and testing MNIST data sources. In the tutorials directory we supply both the JSON and XML versions of this file, and the remainder of this tutorial is completely agnostic to which one is used.
var configPath = Paths.get("configuration","example-config.json");
String.join("\n",Files.readAllLines(configPath));
Now we'll make a ConfigurationManager
and hand it the configuration file to load. Our configuration system also supports CLI options which can load things out of the supplied configuration files. We have examples of this in each of the simple TrainTest
demo classes in each prediction backend.
var cm = new ConfigurationManager(configPath.toString());
First we'll load in the training and testing DataSource
s (as instances of IDXDataSource
), pass them into two Dataset
s to aggregate the appropriate metadata, and we'll make the evaluator for later use.
DataSource<Label> mnistTrain = (DataSource<Label>) cm.lookup("mnist-train");
DataSource<Label> mnistTest = (DataSource<Label>) cm.lookup("mnist-test");
var trainData = new MutableDataset<>(mnistTrain);
var testData = new MutableDataset<>(mnistTest);
var evaluator = new LabelEvaluator();
System.out.println(String.format("Training data size = %d, number of features = %d, number of classes = %d",trainData.size(),trainData.getFeatureMap().size(),trainData.getOutputInfo().size()));
System.out.println(String.format("Testing data size = %d, number of features = %d, number of classes = %d",testData.size(),testData.getFeatureMap().size(),testData.getOutputInfo().size()));
Loading in trainers from the configuration¶
Our configuration file contains a number of different trainers, so let's pull them out and take a look.
The first one we'll see is a CART decision tree, with a max tree depth of 6.
var cart = (Trainer<Label>) cm.lookup("cart");
cart
Next we'll load an XGBoost trainer, using 10 trees, 6 computation threads, and some regularisation parameters. Note: Tribuo's XGBoost support relies upon the Maven Central XGBoost jar from DMLC which contains macOS and Linux binaries, on Windows please compile DMLC's XGBoost jar from source and rebuild Tribuo.
var xgb = (Trainer<Label>) cm.lookup("xgboost");
xgb
Finally we'll load in a logistic regression trainer, using AdaGrad as the gradient optimizer.
var logistic = (Trainer<Label>) cm.lookup("logistic");
logistic
We can also load a list in containing all the Trainer
implementations in this config file. Note: the config system by default returns the same instance when it's queried for the same named config. So the list contains references to the objects we've already loaded.
var trainers = (List<Trainer>) cm.lookupAll(Trainer.class);
System.out.println("Loaded " + trainers.size() + " trainers.");
Training the model and extracting configuration¶
We're going to focus on the logistic regression trainer now, so let's train a logistic regression model on our MNIST training set.
var lrStartTime = System.currentTimeMillis();
var lrModel = logistic.train(trainData);
var lrEndTime = System.currentTimeMillis();
System.out.println("Training logistic regression took " + Util.formatDuration(lrStartTime,lrEndTime));
We can inspect the trained model for it's provenance, as we saw in the Classification tutorial.
The new step is extracting a configuration from that provenance. The ProvenanceUtil.extractConfiguration()
call returns a List<ConfigurationData>
which is the object representation of a configuration file. We can see that it's extracted configurations for 5 objects from our single model, we'll look at those after we've written out the file.
var provenance = lrModel.getProvenance();
var provConfig = ProvenanceUtil.extractConfiguration(provenance);
provConfig.size()
The ConfigurationManager
is the way we can generate a configuration file from the object representation.
We create a new ConfigurationManager
, add the configuration we extracted from the provenance, and then write
it out to a new JSON file.
var outputFile = "mnist-logistic-config.json";
var newCM = new ConfigurationManager();
newCM.addConfiguration(provConfig);
newCM.save(new File(outputFile),true);
String.join("\n",Files.readAllLines(Paths.get(outputFile)))
The five elements of the configuration are: the training data "idxdatasource-1", the logistic regression "linearsgdtrainer-0", the training log loss function "logmulticlass-3", the AdaGrad gradient optimizer "adagrad-2", and the label factory "labelfactory-4". The only unexpected part is the LabelFactory
which is the factory that converts String
s into Label
instances.
Rebuilding a model from it's configuration¶
Now to reconstruct our model, we can load in the Trainer and DataSource from the new ConfigurationManager
, pass the source into a Dataset
, and finally call train on the new trainer supplying the new dataset.
var newTrainer = (Trainer<Label>) newCM.lookup("linearsgdtrainer-0");
var newSource = (DataSource<Label>) newCM.lookup("idxdatasource-1");
var newDataset = new MutableDataset<>(newSource);
var newModel = newTrainer.train(newDataset, Map.of("reconfigured-model",new BooleanProvenance("reconfigured-model",true)));
First we'll confirm that the old model and new models aren't equal (as they have different timestamps, among other provenance checks).
lrModel.getProvenance().equals(newModel.getProvenance())
Now we'll evaluate the first model:
var lrEvaluator = evaluator.evaluate(lrModel,testData);
System.out.println(lrEvaluator.toString());
System.out.println(lrEvaluator.getConfusionMatrix().toString());
It's about what we'd expect for a linear model on MNIST. Not state-of-the-art (SOTA), but it'll do for now.
Now let's check the new model:
var newEvaluator = evaluator.evaluate(newModel,testData);
System.out.println(newEvaluator.toString());
System.out.println(newEvaluator.getConfusionMatrix().toString());
We can see that both models perform identically. This is because our provenance system records the RNG seeds used at all points, and Tribuo is scrupulous about how and when it uses PRNGs. If you find a model reconstruction that gives a different answer (unless you're using XGBoost or TensorFlow, both of which have some non-determinism beyond our control) then file an issue on our GitHub as that's a bug.
What else lives in the Provenance?¶
These evaluations have provenance in the same way the models do, and we can use a pretty printer in OLCUT to make it a little more human readable.
In addition to the configuration information like the gradient optimiser and RNG seed, the provenance includes run specific information like the "reconfigured-model" flag we added, along with a hash of the data, timestamps for the various data files involved, and a timestamp for the model creation and dataset creation.
var evalProvenance = newEvaluator.getProvenance();
System.out.println(ProvenanceUtil.formattedProvenanceString(evalProvenance));
Feature Transformations¶
We can take the new trainer, wrap it programmatically in a TransfomTrainer which rescales the input features into the range [0,1]
, and still generate provenance and configuration automatically as the model is trained.
var transformations = new TransformationMap(List.of(new LinearScalingTransformation(0,1)));
var transformed = new TransformTrainer(newTrainer,transformations);
var transformStart = System.currentTimeMillis();
var transformedModel = transformed.train(newDataset);
var transformEnd = System.currentTimeMillis();
System.out.println("Training transformed logistic regression took " + Util.formatDuration(transformStart,transformEnd));
Now we'll evaluate the rescaled model. Here we see that rescaling the data into the zero-one range improves the linear model performance a couple of percent as all the data is now on the same scale. As expected it's still not SOTA, but we're not using a huge CNN or some other complex model, for that you can try out our TensorFlow interface, or use the XGBoost trainer we loaded in from the original configuration file.
LabelEvaluation transformedEvaluator = evaluator.evaluate(transformedModel,testData);
System.out.println(transformedEvaluator.toString());
System.out.println(transformedEvaluator.getConfusionMatrix().toString());
We can emit a configuration which includes both the transformation trainer and the original trainer pulled from the old configuration. We'll write it out to a byte array rather than putting it on disk, but the process is the same.
var transformedProvConfig = ProvenanceUtil.extractConfiguration(transformedModel.getProvenance());
var baos = new ByteArrayOutputStream();
newCM = new ConfigurationManager();
newCM.addConfiguration(transformedProvConfig);
newCM.save(baos,"json",true);
baos.toString();
Aside from the names (which have different tag numbers) we can see that this configuration is identical to the previous one, but with the addition of the transformtrainer-0
and it's dependents.
Conclusion¶
We've taken a closer look at Tribuo's configuration and provenance systems, showing how to train a model using a configuration file, how to inspect the model's provenance, extract it's configuration, and finally how to combine that extracted configuration with other programmatic elements of the Tribuo library (in this case the feature transformation system). We saw that the provenance combines both the configuration of the trainer and the datasource, along with runtime information extracted from the dataset itself (e.g., timestamps and file hashes).
Tribuo's configuration system is integrated into a CLI options/arguments parsing system, which can be used to override elements from the configuration file. The values from the options are then stored in the ConfigurationManager
and appear in the provenance and downstream configuration objects as expected. Tribuo also provides a redaction system for configuration files (e.g., to ensure a password isn't stored in the provenance) and for provenance objects themselves (e.g., to remove the data provenance from a trained model), which aids model deployment to untrusted or less trusted systems.