# Reproducibility Tutorial¶

Reproducibility of ML models and evaluations is frequently a problem across many ML systems. It's usually two problems, the first is a description of the computation that was executed, and the second is a method of replaying that computation. In Tribuo we built our provenance system to make our models self-describing by which we mean they capture a complete description of the computation that produced them, solving the first issue. In v4.2 we added an automated reproducibility system which consumes the provenance data and retrains the model. As well as the reproducibility system we also added a mechanism for diffing provenance objects allowing easy comparison between the reproduced and original models. This is because the models are only guaranteed to be identical if the data is the same, and any differences in the data will show up in the data provenance object.

## Setup¶

Before running this tutorial, please run the irises classification and ONNX export tutorial to build the two models that we're going to reproduce.

We're going to load in the classification jar, onnx jar, and the reproducibility jar. Note the reproducibility jar is written in Java 16, and so this tutorial requires Java 16 or later. Then we'll import the necessary classes.

In [1]:
%jars ./tribuo-classification-experiments-4.3.0-jar-with-dependencies.jar
%jars ./tribuo-onnx-4.3.0-jar-with-dependencies.jar
%jars ./tribuo-json-4.3.0-jar-with-dependencies.jar
%jars ./tribuo-reproducibility-4.3.0-jar-with-dependencies.jar

In [2]:
import org.tribuo.*;
import org.tribuo.classification.*;
import org.tribuo.classification.evaluation.*;
import org.tribuo.classification.sgd.fm.*;
import org.tribuo.classification.sgd.linear.*;
import org.tribuo.datasource.*;
import org.tribuo.interop.onnx.*;
import org.tribuo.reproducibility.*;
import com.oracle.labs.mlrg.olcut.provenance.*;
import com.oracle.labs.mlrg.olcut.util.*;
import ai.onnxruntime.*;

import java.nio.file.*;


## Reproducing a Tribuo Model¶

The reproducibility system works on Tribuo Model or ModelProvenance objects. When using the ModelProvenance the system loads in the original training data, processes and transforms it according to the columnar processing and transforms applied, then rebuilds the original trainer including it's RNG state, before passing the data into the train method and returning the reproduced model. When using the Model object, it performs the same steps as for a ModelProvenance and then compares the feature and output domains to provide more information about any differences between the feature and output domains used by the model. Over time we plan to expand the validation applied to the reproduced model to show if the features have different ranges or histograms.

We're going to load in the Irises logistic regression model trained in the first tutorial.

In [3]:
File irisModelFile = new File("iris-lr-model.ser");
ObjectInputFilter filter = ObjectInputFilter.Config.createFilter(filterPattern);
try (ObjectInputStream ois = new ObjectInputStream(new BufferedInputStream(new FileInputStream(irisModelFile)))) {
ois.setObjectInputFilter(filter);
}


linear-sgd-model - Model(class-name=org.tribuo.classification.sgd.linear.LinearSGDModel,dataset=Dataset(class-name=org.tribuo.MutableDataset,datasource=SplitDataSourceProvenance(className=org.tribuo.evaluation.TrainTestSplitter,innerSourceProvenance=DataSource(class-name=org.tribuo.data.csv.CSVDataSource,headers=[sepalLength, sepalWidth, petalLength, petalWidth, species],rowProcessor=RowProcessor(class-name=org.tribuo.data.columnar.RowProcessor,metadataExtractors=[],fieldProcessorList=[FieldProcessor(class-name=org.tribuo.data.columnar.processors.field.DoubleFieldProcessor,fieldName=petalLength,onlyFieldName=true,throwOnInvalid=true,host-short-name=FieldProcessor), FieldProcessor(class-name=org.tribuo.data.columnar.processors.field.DoubleFieldProcessor,fieldName=petalWidth,onlyFieldName=true,throwOnInvalid=true,host-short-name=FieldProcessor), FieldProcessor(class-name=org.tribuo.data.columnar.processors.field.DoubleFieldProcessor,fieldName=sepalWidth,onlyFieldName=true,throwOnInvalid=true,host-short-name=FieldProcessor), FieldProcessor(class-name=org.tribuo.data.columnar.processors.field.DoubleFieldProcessor,fieldName=sepalLength,onlyFieldName=true,throwOnInvalid=true,host-short-name=FieldProcessor)],featureProcessors=[],responseProcessor=ResponseProcessor(class-name=org.tribuo.data.columnar.processors.response.FieldResponseProcessor,uppercase=false,fieldNames=[species],defaultValues=[],displayField=false,outputFactory=OutputFactory(class-name=org.tribuo.classification.LabelFactory),host-short-name=ResponseProcessor),weightExtractor=null,replaceNewlinesWithSpaces=true,regexMappingProcessors={},host-short-name=RowProcessor),quote=",outputRequired=true,outputFactory=OutputFactory(class-name=org.tribuo.classification.LabelFactory),separator=,,dataPath=/local/ExternalRepositories/tribuo/tutorials/bezdekIris.data,resource-hash=SHA-256[0FED2A99DB77EC533A62DC66894D3EC6DF3B58B6A8F3CF4A6B47E4086B7F97DC],file-modified-time=1999-12-14T15:12:39-05:00,datasource-creation-time=2022-10-07T11:20:06.279351-04:00,host-short-name=DataSource),trainProportion=0.7,seed=1,size=150,isTrain=true),transformations=[],is-sequence=false,is-dense=true,num-examples=105,num-features=4,num-outputs=3,tribuo-version=4.3.0),trainer=Trainer(class-name=org.tribuo.classification.sgd.linear.LogisticRegressionTrainer,seed=12345,minibatchSize=1,shuffle=true,epochs=5,optimiser=StochasticGradientOptimiser(class-name=org.tribuo.math.optimisers.AdaGrad,epsilon=0.1,initialLearningRate=1.0,initialValue=0.0,host-short-name=StochasticGradientOptimiser),loggingInterval=1000,objective=LabelObjective(class-name=org.tribuo.classification.sgd.objectives.LogMulticlass,host-short-name=LabelObjective),tribuo-version=4.3.0,train-invocation-count=0,is-sequence=false,host-short-name=Trainer),trained-at=2022-10-07T11:20:06.643297-04:00,instance-values={},tribuo-version=4.3.0,java-version=12,os-name=Linux,os-arch=amd64)


The reproducibility system lives in the ReproUtil class. This class is constructed with a Model or a ModelProvenance and Class<T extends Output<T>> for the output class.

In [4]:
var repro = new ReproUtil<>(loadedModel);


Now we can separately rebuild the dataset and the trainer, though note if you mutate the objects returned by these methods then you won't get the exact same model back from the reproduction. We're still working on the API for the reproducibility system and expect to make this API more robust over time.

In [5]:
var dataset = repro.recoverDataset();

System.out.println(ProvenanceUtil.formattedProvenanceString(dataset.getProvenance()));

MutableDataset(
class-name = org.tribuo.MutableDataset
datasource = TrainTestSplitter(
class-name = org.tribuo.evaluation.TrainTestSplitter
source = CSVDataSource(
class-name = org.tribuo.data.csv.CSVDataSource
sepalLength
sepalWidth
petalLength
petalWidth
species
]
rowProcessor = RowProcessor(
class-name = org.tribuo.data.columnar.RowProcessor
fieldProcessorList = List[
DoubleFieldProcessor(
class-name = org.tribuo.data.columnar.processors.field.DoubleFieldProcessor
fieldName = petalLength
onlyFieldName = true
throwOnInvalid = true
host-short-name = FieldProcessor
)
DoubleFieldProcessor(
class-name = org.tribuo.data.columnar.processors.field.DoubleFieldProcessor
fieldName = petalWidth
onlyFieldName = true
throwOnInvalid = true
host-short-name = FieldProcessor
)
DoubleFieldProcessor(
class-name = org.tribuo.data.columnar.processors.field.DoubleFieldProcessor
fieldName = sepalWidth
onlyFieldName = true
throwOnInvalid = true
host-short-name = FieldProcessor
)
DoubleFieldProcessor(
class-name = org.tribuo.data.columnar.processors.field.DoubleFieldProcessor
fieldName = sepalLength
onlyFieldName = true
throwOnInvalid = true
host-short-name = FieldProcessor
)
]
featureProcessors = List[]
responseProcessor = FieldResponseProcessor(
class-name = org.tribuo.data.columnar.processors.response.FieldResponseProcessor
uppercase = false
fieldNames = List[
species
]
defaultValues = List[

]
displayField = false
outputFactory = LabelFactory(
class-name = org.tribuo.classification.LabelFactory
)
host-short-name = ResponseProcessor
)
weightExtractor = FieldExtractor(
class-name = org.tribuo.data.columnar.FieldExtractor
)
replaceNewlinesWithSpaces = true
regexMappingProcessors = Map{}
host-short-name = RowProcessor
)
quote = "
outputRequired = true
outputFactory = LabelFactory(
class-name = org.tribuo.classification.LabelFactory
)
separator = ,
dataPath = /local/ExternalRepositories/tribuo/tutorials/bezdekIris.data
resource-hash = 0FED2A99DB77EC533A62DC66894D3EC6DF3B58B6A8F3CF4A6B47E4086B7F97DC
file-modified-time = 1999-12-14T15:12:39-05:00
datasource-creation-time = 2022-10-07T12:03:48.921236415-04:00
host-short-name = DataSource
)
train-proportion = 0.7
seed = 1
size = 150
is-train = true
)
transformations = List[]
is-sequence = false
is-dense = true
num-examples = 105
num-features = 4
num-outputs = 3
tribuo-version = 4.3.0
)


Our irises dataset was loaded in using the CSVLoader and split with a 70/30 train test split, and we can see that the reproduced training dataset has been split just as we expect.

In [6]:
var trainer = repro.recoverTrainer();
System.out.println(ProvenanceUtil.formattedProvenanceString(trainer.getProvenance()));

LogisticRegressionTrainer(
class-name = org.tribuo.classification.sgd.linear.LogisticRegressionTrainer
seed = 12345
minibatchSize = 1
shuffle = true
epochs = 5
epsilon = 0.1
initialLearningRate = 1.0
initialValue = 0.0
)
loggingInterval = 1000
objective = LogMulticlass(
class-name = org.tribuo.classification.sgd.objectives.LogMulticlass
host-short-name = LabelObjective
)
tribuo-version = 4.3.0
train-invocation-count = 0
is-sequence = false
host-short-name = Trainer
)


The irises model is a logistic regression, using seed 12345 and it's the first model trained by that trainer (as train-invocation-count is zero).

In [7]:
var reproduction = repro.reproduceFromModel();
var reproducedModel = (LinearSGDModel) reproduction.model();


We can compare this provenance to the one in the original model using our diff tool, however as Tribuo records construction timestamps they will not be identical.

In [8]:
System.out.println(ReproUtil.diffProvenance(loadedModel.getProvenance(),reproducedModel.getProvenance()));

{
"dataset" : {
"datasource" : {
"source" : {
"datasource-creation-time" : {
"original" : "2022-10-07T11:20:06.279351-04:00",
"reproduced" : "2022-10-07T12:03:48.921236415-04:00"
}
}
}
},
"java-version" : {
"original" : "12",
"reproduced" : "17.0.4.1"
},
"trained-at" : {
"original" : "2022-10-07T11:20:06.643297-04:00",
"reproduced" : "2022-10-07T12:03:49.150931420-04:00"
}
}


We can see that the timestamps are a little different, though the precise difference will depend on when you ran the irises tutorial. You may also see differences in the JVM or other machine provenance if you ran that tutorial on a different machine. If the irises dataset grows a new feature or additional rows in the same file, then the diff will show that the datasets have different numbers of features or samples, and that the file has a different hash.

For some models we can easily compare the model contents, e.g., for the logistic regression we can directly compare the model weights.

In [9]:
var originalWeights = loadedModel.getWeightsCopy();
var reproducedWeights = reproducedModel.getWeightsCopy();

System.out.println("Weights are equal = " + originalWeights.equals(reproducedWeights));

Weights are equal = true


## Reproducing an ONNX exported Tribuo Model¶

Tribuo models can be exported into the ONNX format. When Tribuo models are exported the model provenance is stored as a metadata field in the ONNX file. This doesn't affect anything which serves the ONNX model, but allows Tribuo to load the provenance back in if the model is loaded in as an ONNXExternalModel which is Tribuo's class for loading in ONNX models.

To load a model in as an ONNXExternalModel we need to define the feature and label mappings which should be written out separately when the ONNX model is exported. We're going to cheat slightly and get them from the MNIST training set itself.

In [10]:
var labelFactory = new LabelFactory();
var mnistTrainSource = new IDXDataSource<>(Paths.get("train-images-idx3-ubyte.gz"),Paths.get("train-labels-idx1-ubyte.gz"),labelFactory);
var mnistTestSource = new IDXDataSource<>(Paths.get("t10k-images-idx3-ubyte.gz"),Paths.get("t10k-labels-idx1-ubyte.gz"),labelFactory);
var mnistTrain = new MutableDataset<>(mnistTrainSource);
var mnistTest = new MutableDataset<>(mnistTestSource);

Map<String, Integer> mnistFeatureMap = new HashMap<>();
for (VariableInfo f : mnistTrain.getFeatureIDMap()){
VariableIDInfo id = (VariableIDInfo) f;
mnistFeatureMap.put(id.getName(),id.getID());
}
Map<Label, Integer> mnistOutputMap = new HashMap<>();
for (Pair<Integer,Label> l : mnistTrain.getOutputIDInfo()) {
mnistOutputMap.put(l.getB(), l.getA());
}


Now let's load in the ONNX file:

In [11]:
var ortEnv = OrtEnvironment.getEnvironment();
var sessionOpts = new OrtSession.SessionOptions();
var denseTransformer = new DenseTransformer();
var labelTransformer = new LabelTransformer();
var mnistModelPath = Paths.get(".","fm-mnist.onnx");
ONNXExternalModel<Label> onnx = ONNXExternalModel.createOnnxModel(labelFactory, mnistFeatureMap, mnistOutputMap,
denseTransformer, labelTransformer, sessionOpts, mnistModelPath, "input");


This model has two provenance objects, one from the creation of the ONNXExternalModel, and one from the original training run in Tribuo which is persisted inside the ONNX file.

In [12]:
System.out.println(ProvenanceUtil.formattedProvenanceString(onnx.getProvenance()));

ONNXExternalModel(
class-name = org.tribuo.interop.onnx.ONNXExternalModel
dataset = Dataset(
class-name = org.tribuo.Dataset
datasource = DataSource(
description = unknown-external-data
outputFactory = LabelFactory(
class-name = org.tribuo.classification.LabelFactory
)
datasource-creation-time = 2022-10-07T12:03:57.351723125-04:00
)
transformations = List[]
is-sequence = false
is-dense = false
num-examples = -1
num-features = 717
num-outputs = 10
tribuo-version = 4.3.0
)
trainer = Trainer(
class-name = org.tribuo.Trainer
fileModifiedTime = 2022-10-07T11:46:10.476-04:00
location = file:/local/ExternalRepositories/tribuo/tutorials/./fm-mnist.onnx
)
trained-at = 2022-10-07T12:03:57.349886186-04:00
instance-values = Map{
model-domain=org.tribuo.tutorials.onnxexport.fm
model-graphname=FMClassificationModel
model-producer=Tribuo
model-version=0
input-name=input
}
tribuo-version = 4.3.0
java-version = 17.0.4.1
os-name = Linux
os-arch = amd64
)


The ONNXExternalModel provenance has a lot of placeholders in it, as you might expect given the information is not always present in ONNX files.

We can load the Tribuo model provenance using getTribuoProvenance():

In [13]:
var tribuoProvenance = onnx.getTribuoProvenance().get();
System.out.println(ProvenanceUtil.formattedProvenanceString(tribuoProvenance));

FMClassificationModel(
class-name = org.tribuo.classification.sgd.fm.FMClassificationModel
dataset = MutableDataset(
class-name = org.tribuo.MutableDataset
datasource = IDXDataSource(
class-name = org.tribuo.datasource.IDXDataSource
outputFactory = LabelFactory(
class-name = org.tribuo.classification.LabelFactory
)
outputPath = /local/ExternalRepositories/tribuo/tutorials/train-labels-idx1-ubyte.gz
featuresPath = /local/ExternalRepositories/tribuo/tutorials/train-images-idx3-ubyte.gz
features-file-modified-time = 2000-07-21T14:20:24-04:00
output-resource-hash = 3552534A0A558BBED6AED32B30C495CCA23D567EC52CAC8BE1A0730E8010255C
datasource-creation-time = 2022-10-07T11:45:53.253680-04:00
output-file-modified-time = 2000-07-21T14:20:27-04:00
idx-feature-type = UBYTE
features-resource-hash = 440FCABF73CC546FA21475E81EA370265605F56BE210A4024D2CA8F203523609
host-short-name = DataSource
)
transformations = List[]
is-sequence = false
is-dense = false
num-examples = 60000
num-features = 717
num-outputs = 10
tribuo-version = 4.3.0
)
trainer = FMClassificationTrainer(
class-name = org.tribuo.classification.sgd.fm.FMClassificationTrainer
seed = 12345
variance = 0.1
minibatchSize = 1
factorizedDimSize = 6
shuffle = true
epochs = 5
epsilon = 0.1
initialLearningRate = 0.1
initialValue = 0.0
)
loggingInterval = 30000
objective = LogMulticlass(
class-name = org.tribuo.classification.sgd.objectives.LogMulticlass
host-short-name = LabelObjective
)
tribuo-version = 4.3.0
train-invocation-count = 0
is-sequence = false
host-short-name = Trainer
)
trained-at = 2022-10-07T11:46:09.759423-04:00
instance-values = Map{}
tribuo-version = 4.3.0
java-version = 12
os-name = Linux
os-arch = amd64
)


From this provenance we can see that the model is a factorization machine running on MNIST (as expected). So now we can build a ReproUtil and rebuild the model.

In [14]:
var mnistRepro = new ReproUtil<>(tribuoProvenance,Label.class);

var reproducedMNISTModel = mnistRepro.reproduceFromProvenance();


We can diff the two provenances:

In [15]:
System.out.println(ReproUtil.diffProvenance(tribuoProvenance, reproducedMNISTModel.getProvenance()));

{
"dataset" : {
"datasource" : {
"datasource-creation-time" : {
"original" : "2022-10-07T11:45:53.253680-04:00",
"reproduced" : "2022-10-07T12:04:03.138366189-04:00"
}
}
},
"java-version" : {
"original" : "12",
"reproduced" : "17.0.4.1"
},
"trained-at" : {
"original" : "2022-10-07T11:46:09.759423-04:00",
"reproduced" : "2022-10-07T12:04:15.478400652-04:00"
}
}


As before, it's not very interesting as we're using the same files and so only the creation timestamps are differing. Checking the model weights is tricky with an ONNX model, so we can instead check that the predictions are the same (though Tribuo computes in doubles and ONNX Runtime uses floats so the answers are slightly different). We'll borrow the checkPredictions function from the ONNX export tutorial.

In [16]:
public boolean checkPredictions(List<Prediction<Label>> nativePredictions, List<Prediction<Label>> onnxPredictions, double delta) {
for (int i = 0; i < nativePredictions.size(); i++) {
Prediction<Label> tribuo = nativePredictions.get(i);
Prediction<Label> external = onnxPredictions.get(i);
// Check the predicted label
if (!tribuo.getOutput().getLabel().equals(external.getOutput().getLabel())) {
System.out.println("At index " + i + " predictions are not equal - "
+ tribuo.getOutput().getLabel() + " and "
+ external.getOutput().getLabel());
return false;
}
// Check the maximum score
if (Math.abs(tribuo.getOutput().getScore() - external.getOutput().getScore()) > delta) {
System.out.println("At index " + i + " predictions are not equal - "
+ tribuo.getOutput() + " and "
+ external.getOutput());
return false;
}
// Check the score distribution
for (Map.Entry<String, Label> l : tribuo.getOutputScores().entrySet()) {
Label other = external.getOutputScores().get(l.getKey());
if (other == null) {
System.out.println("At index " + i + " failed to find label " + l.getKey() + " in ORT prediction.");
return false;
} else {
if (Math.abs(l.getValue().getScore() - other.getScore()) > delta) {
System.out.println("At index " + i + " predictions are not equal - "
+ tribuo.getOutputScores() + " and "
+ external.getOutputScores());
return false;
}
}
}
}
return true;
}


Now we can make predictions from both models and compare the outputs:

In [17]:
var onnxPredictions = onnx.predict(mnistTest);
var reproducedPredictions = reproducedMNISTModel.predict(mnistTest);

System.out.println("Predictions are equal = " + checkPredictions(reproducedPredictions,onnxPredictions,1e-5));

Predictions are equal = true


## Working with provenance diffs¶

We can use the provenance diff methods to compute diffs for unrelated models too. We're going to train a logistic regression on MNIST and compare the model provenance against the ONNX factorization machine we just used.

In [18]:
var lrTrainer = new LogisticRegressionTrainer();
var lrModel = lrTrainer.train(mnistTrain);

System.out.println(ReproUtil.diffProvenance(tribuoProvenance, lrModel.getProvenance()));

{
"class-name" : {
"original" : "org.tribuo.classification.sgd.fm.FMClassificationModel",
"reproduced" : "org.tribuo.classification.sgd.linear.LinearSGDModel"
},
"dataset" : {
"datasource" : {
"datasource-creation-time" : {
"original" : "2022-10-07T11:45:53.253680-04:00",
"reproduced" : "2022-10-07T12:03:56.006018468-04:00"
}
}
},
"java-version" : {
"original" : "12",
"reproduced" : "17.0.4.1"
},
"trained-at" : {
"original" : "2022-10-07T11:46:09.759423-04:00",
"reproduced" : "2022-10-07T12:04:24.453627627-04:00"
},
"trainer" : {
"class-name" : {
"original" : "org.tribuo.classification.sgd.fm.FMClassificationTrainer",
"reproduced" : "org.tribuo.classification.sgd.linear.LogisticRegressionTrainer"
},
"loggingInterval" : {
"original" : "30000",
"reproduced" : "1000"
},
"optimiser" : {
"initialLearningRate" : {
"original" : "0.1",
"reproduced" : "1.0"
}
},
"factorizedDimSize" : {
"original" : "6"
},
"variance" : {
"original" : "0.1"
}
}
}


This diff is longer than the others we've seen, as expected for two different models with different trainers. As expected the dataset section is mostly empty as both models are trained on an unmodified MNIST training set. The FMClassificationTrainer and LogisticRegressionTrainer show more differences, but as both are SGD based models there are many common fields. They share fields like a loss function (both used LogMulticlass), a gradient optimiser (both used AdaGrad), the number of training epochs, and the minibatch size. They used different learning rates (which do appear in the diff under optimiser) and the factorization machine also has a few extra parameters not found in the logistic regression, factorizedDimSize and variance, which are reported as just having an original value, meaning they are only found in the first provenance and not the second.

The current diff format is JSON, and designed to be easily human readable. We left designing a navigable diff object which is easily inspectable from code to future work once we have a better understanding of how people want to use the generated diffs.

## Conclusion¶

We showed how to load in Tribuo models and reproduce them using our automated reproducibility system. The system executes the same computations as the original training, which in most cases results in an identical model. We have noted that there are some differences between gradient descent based models that are trained on ARM and x86 architectures due to underlying differences in the JVM, but otherwise the reproductions are exact. Over time we plan to expand this reproducibility system into a full experimental framework allowing models to be rebuilt using different datasets, data transformations or training hyperparameters holding all other parameters constant.