Loading columnar data into Tribuo

This tutorial demonstrates Tribuo's systems for loading and featurising complex columnar data like csv files, json, and SQL database tables.

Tribuo's Example objects are tuples of a feature list and an output. The features are tuples of String names, and double values. The outputs differ depending by task. For example, classification tasks use Label objects, which denote named members of a finite categorical set, while regression tasks use Regressor objects which denote (optionally multiple, optionally named) real values. (Note that this means that Tribuo Regressor s support multidimensional regression by default). Unlike standard Machine Learning data formats like svmlight/libsvm or IDX, tabular data needs to be converted into Tribuo's representation before it can be loaded. In Tribuo this ETL (Extract-Transform-Load) problem is solved by the org.tribuo.data.columnar package, and specifically the RowProcessor and associated infrastructure.

The RowProcessor

Constructing an instance of RowProcessor<T> equips it with the configuration information that it needs to accept a Map<String,String> representing the keys and values extracted from a single row of data, and emit an Example<T>, performing all specified feature and output conversions. In addition to directly processing the values into features or outputs, the RowProcessor can extract other fields from the row and write them into the Example metadata allowing the storage of timestamps, row id numbers and other information.

There are four configurable elements to the RowProcessor:

  • FieldProcessor which converts a single cell of a columnar input (read as a String) into potentially many named Feature objects.
  • FeatureProcessor which processes a list of Feature objects (e.g., to remove duplicates, add conjunctions or replace features with other features).
  • ResponseProcessor which converts a single cell of a columnar input (read as a String) into an Output subclass by passing it to the specified OutputFactory.
  • FieldExtractor which extracts a single metadata field while processing the whole row at once.

Each of these is an interface which has multiple implementations available in Tribuo, and you can supply custom ones for specific processing.

The RowProcessor can be used at both training and inference time. If the output isn't available in the Example because it's constructed at inference/deployment time then the returned Example will contain a sentinel "Unknown" output of the appropriate type. Thanks to Tribuo's provenance system, the RowProcessor configuration is stored inside the Model object and so can be reconstructed at inference time with just the model, no other code is required.

ColumnarDataSource

The RowProcessor provides the transformation logic that converts data from rows into features, outputs, and metadata, and the remainder of the logic that generates the row and Example objects resides in ColumnarDataSource. In general you'll use one of it's subclasses: CSVDataSource, JsonDataSource or SQLDataSource, depending on what format the input data is in. If there are other columnar formats that you'd like to see, open an issue on our GitHub page and we can discuss how that could fit into Tribuo.

In this tutorial we'll process a couple of example files in csv and json formats to see how to programmatically construct a CSVDataSource and a JsonDataSource, check that the two formats produce the same examples, and then look at how to use the RowProcessor at inference time.

Setup

First we need to pull in the necessary Tribuo jars. We use the classification jar for the small test model we build at the end, and the json jar for json processing (naturally).

In [1]:
%jars ./tribuo-classification-experiments-4.0.2-jar-with-dependencies.jar
%jars ./tribuo-json-4.0.2-jar-with-dependencies.jar

Import the necessary classes from Tribuo and the JDK:

In [2]:
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.charset.StandardCharsets;
import java.util.Locale;

import com.oracle.labs.mlrg.olcut.config.ConfigurationManager;
import com.oracle.labs.mlrg.olcut.provenance.ProvenanceUtil;

import org.tribuo.*;
import org.tribuo.data.columnar.*;
import org.tribuo.data.columnar.processors.field.*;
import org.tribuo.data.columnar.processors.response.*;
import org.tribuo.data.columnar.extractors.*;
import org.tribuo.data.csv.CSVDataSource;
import org.tribuo.data.text.impl.BasicPipeline;
import org.tribuo.json.JsonDataSource;
import org.tribuo.classification.*;
import org.tribuo.classification.sgd.linear.LogisticRegressionTrainer;
import org.tribuo.util.tokens.impl.BreakIteratorTokenizer;

Reading a CSV file

Tribuo provides two mechanisms for reading CSV files, CSVLoader and CSVDataSource. We saw CSVLoader in the classification and regression tutorials, and it's designed for loading CSV files which are purely numeric apart from the response column, and when all non-response columns should be used as features. CSVLoader is designed to get a simple file off disk and into Tribuo as quickly as possible, but it doesn't capture most uses of CSV files. It is intentionally quite restrictive, which is why CSVDataSource exists. CSVDataSource in contrast provides full control over how features, outputs and metadata fields are processed and populated from a given CSV file and is the standard way of loading CSV files into Tribuo.

Now let's look at the first few rows of the example CSV file we're going to use (note these example files live next to this notebook in Tribuo's tutorials directory):

In [3]:
var csvPath = Paths.get("columnar-data","columnar-example.csv");
var csvLines = Files.readAllLines(csvPath, StandardCharsets.UTF_8);
csvLines.stream().limit(5).forEach(System.out::println);
id,timestamp,example-weight,height,description,transport,disposition,extra-a,extra-b,extra-c
1,14/10/2020 16:07,0.50,1.73,"aged, grey-white hair, blue eyes, stern disposition",police-box,Good,4.83881,-0.7685,0.87706
2,15/10/2020 16:07,0.50,1.73,"impish, black hair, blue eyes, unkempt",police-box,Good,1.10026,0.51655,0.9632
3,16/10/2020 16:07,1.00,1.9,"grey curly hair, frilly shirt, cape, blue eyes",police-box,Good,0.78601,0.49001,3.03926
4,17/10/2020 16:07,1.00,1.91,"brown curly hair, long multicoloured scarf, jelly babies",police-box,Good,3.56195,2.11667,-0.55366

There are three metadata fields ("id", "timestamp" and "example-weight"), a numerical field ("height"), a text field ("description"), two categorical fields ("transport" and "disposition"), and a further three numerical fields ("extra-a", "extra-b", "extra-c"). This is a small generated dataset, without a clear classification boundary, it's just to demonstrate the features of the RowProcessor. We're going to pick "disposition" as the target field for our classification, with the two possible labels "Good" and "Bad".

We construct the necessary field processors, one that uses the double value directly for height, one which processes the field using a text pipeline emitting bigrams for description, and one which generates a one hot encoded categorical for transport.

In [4]:
var fieldProcessors = new HashMap<String,FieldProcessor>();
fieldProcessors.put("height",new DoubleFieldProcessor("height"));
fieldProcessors.put("description",new TextFieldProcessor("description",new BasicPipeline(new BreakIteratorTokenizer(Locale.US),2)));
fieldProcessors.put("transport",new IdentityProcessor("transport"));

For the remaining three fields we use the regular expression matching function to generate the field processors for us. We supply the regex "extra.*" and the RowProcessor will copy the supplied FieldProcessor for each field which matches the regex. In this case it will generate three DoubleFieldProcessors in total, one each for "extra-A", "extra-B" and "extra-C". Note the field name that's supplied to the DoubleFieldProcessor will be ignored when the new processors are generated for each field which matches the regex.

In [5]:
var regexMappingProcessors = new HashMap<String,FieldProcessor>();
regexMappingProcessors.put("extra.*", new DoubleFieldProcessor("extra.*"));

Now we construct the response processor for the "disposition" field. As it's a categorical and we're performing classification then the standard FieldResponseProcessor will do the trick.

In [6]:
var responseProcessor = new FieldResponseProcessor("disposition","UNK",new LabelFactory());

Finally we setup the metadata extraction. This step is optional, the row processor ignores fields that don't have a FieldProcessor or ResponseProcessor mapping, but it's useful to be able to link an example back to the original data when using the predictions downstream.

In [7]:
var metadataExtractors = new ArrayList<FieldExtractor<?>>();
metadataExtractors.add(new IntExtractor("id"));
metadataExtractors.add(new DateExtractor("timestamp","timestamp","dd/MM/yyyy HH:mm"));
Out[7]:
true

In the DateExtractor the first "timestamp" is the name of the field from which we're extracting, while the second is the name to give the extracted date in the metadata store.

We'll also make a weight extractor which reads from the "example-weight" field.

In [8]:
var weightExtractor = new FloatExtractor("example-weight");

Now we can construct the RowProcessor. If you don't want to weight the examples you can set the second argument to null. Similarly we're not doing any feature processing in this example, so we'll supply Collections.emptySet().

In [9]:
var rowProcessor = new RowProcessor<Label>(metadataExtractors,weightExtractor,responseProcessor,fieldProcessors,regexMappingProcessors,Collections.emptySet());

With a row processor built, we can finally construct the CSVDataSource to read our file. Note that the RowProcessor uses Tribuo's configuration system, so we can construct one from a configuration file to save hard coding the csv (or other format) schema. We can also write out the RowProcessor instance we've created into a configuration file for later use. We'll look at this later on when we rebuild this row processor from a trained Model's provenance.

In [10]:
var csvSource = new CSVDataSource<Label>(csvPath,rowProcessor,true); 
// The boolean argument indicates whether the reader should fail if an output value is missing. 
// Typically it is true at train/test time, but false in deployment/live use when true output values are unknown.

var datasetFromCSV = new MutableDataset<Label>(csvSource);

System.out.println("Number of examples = " + datasetFromCSV.size());
System.out.println("Number of features = " + datasetFromCSV.getFeatureMap().size());
System.out.println("Label domain = " + datasetFromCSV.getOutputIDInfo().getDomain());
Number of examples = 20
Number of features = 148
Label domain = [BAD, GOOD]

Let's look at the first example and see what features and metadata are extracted.

In [11]:
System.out.println(datasetFromCSV.getExample(0).toString());
ArrayExample(numFeatures=22,output=GOOD,weight=0.5,metadata={id=1, timestamp=2020-10-14},features=[(description@1-N=,, 1.0)(description@1-N=aged, 1.0), (description@1-N=blue, 1.0), (description@1-N=disposition, 1.0), (description@1-N=eyes, 1.0), (description@1-N=grey-white, 1.0), (description@1-N=hair, 1.0), (description@1-N=stern, 1.0), (description@2-N=,/blue, 1.0), (description@2-N=,/grey-white, 1.0), (description@2-N=,/stern, 1.0), (description@2-N=aged/,, 1.0), (description@2-N=blue/eyes, 1.0), (description@2-N=eyes/,, 1.0), (description@2-N=grey-white/hair, 1.0), (description@2-N=hair/,, 1.0), (description@2-N=stern/disposition, 1.0), (extra-a@value, 4.83881), (extra-b@value, -0.7685), (extra-c@value, 0.87706), (height@value, 1.73), (transport@police-box, 1.0), ])

We can see the two metadata fields have been populated, one for the id and one for timestamp, and that the output label is GOOD. The weight of 0.5 has also been extracted (otherwise it defaults to 1.0). Next come the text unigrams and bigrams extracted from the description field. Unigrams are named description@1-N=<token> and bigrams are named description@2-N=<token>,<token>, and the value is the number of times that unigram or bigram occurred in the text. After the text features comes the three features extracted via the regex expansion, extra-a, extra-b and extra-c, each with the floating point value. Then comes height with the floating point value extracted, and finally transport extracted as a one-hot categorical feature. That is, an example can only have either transport@police-box or transport@starship as a feature with the value 1.0 in any given example.

Reading a JSON file

Tribuo's JsonDataSource supports reading flat json objects from a json array, which is fairly restrictive. JSON is such a flexible format it's hard to build parsers for everything, but the JsonDataSource should be a good place to start if you need something more complicated.

We'll use a JSON version of the CSV file from above, and again we'll print the first few lines from the JSON file to show the format.

In [12]:
var jsonPath = Paths.get("columnar-data","columnar-example.json");
var jsonLines = Files.readAllLines(jsonPath, StandardCharsets.UTF_8);
jsonLines.stream().limit(14).forEach(System.out::println);
[
 {
   "id": 1,
   "timestamp": "14/10/2020 16:07",
   "example-weight": 0.5,
   "height": 1.73,
   "description": "aged, grey-white hair, blue eyes, stern disposition",
   "transport": "police-box",
   "disposition": "Good",
   "extra-a": 4.83881,
   "extra-b": -0.7685,
   "extra-c": 0.87706
 },
 {

We can re-use the RowProcessor from earlier, as it doesn't know anything about the serialized format of the data, and supply it to the JsonDataSource constructor.

In [13]:
var jsonSource = new JsonDataSource<>(jsonPath,rowProcessor,true);

var datasetFromJson = new MutableDataset<Label>(jsonSource);

System.out.println("Number of examples = " + datasetFromJson.size());
System.out.println("Number of features = " + datasetFromJson.getFeatureMap().size());
System.out.println("Label domain = " + datasetFromJson.getOutputIDInfo().getDomain());
Number of examples = 20
Number of features = 148
Label domain = [BAD, GOOD]

As the CSV file and the JSON file contain the same data, we should get the same examples out, in the same order. Note the DataSource provenances will not be the same (as the hashes, timestamps and file paths are different), and datasets don't implement equals so we need to do this comparison example by example.

In [14]:
boolean isEqual = true;
for (int i = 0; i < datasetFromJson.size(); i++) {
    boolean equals = datasetFromJson.getExample(i).equals(datasetFromCSV.getExample(i));
    if (!equals) {
        System.out.println("Example " + i + " not equal");
        System.out.println("JSON - " + datasetFromJson.getExample(i).toString());
        System.out.println("CSV - " + datasetFromCSV.getExample(i).toString());
    }
}
System.out.println("isEqual = " + isEqual);
isEqual = true

Now we're going to train a simple model, to show how to rebuild the RowProcessor from the model's provenance object (which allows you to rebuild the data ingest pipeline from the model itself).

First we train the model.

In [15]:
var model = new LogisticRegressionTrainer().train(datasetFromJson);

Then we extract the dataset provenance and convert it into a configuration.

In [16]:
var dataProvenance = model.getProvenance().getDatasetProvenance();
var provConfig = ProvenanceUtil.extractConfiguration(dataProvenance);

Then we feed the configuration to a ConfigurationManager so we can rebuild the data ingest pipeline used at training time.

In [17]:
var cm = new ConfigurationManager();
cm.addConfiguration(provConfig);

Now the ConfigurationManager contains all the configuration necessary to rebuild the DataSource we used to build the model. However all we want is the RowProcessor instance, as the JsonDataSource itself isn't particularly useful at inference time.

In [18]:
RowProcessor<Label> newRowProcessor = (RowProcessor<Label>) cm.lookup("rowprocessor-1");

This row processor has the original regexes inside it rather than the concretely expanded FieldProcessors bound to each field that matched the regex, so first we need to expand the row processor with the headers from the original DataSource (or our inference time data). Then we'll pass it another row and look at the Example produced to check that everything is working. As this is a test time example, we don't have a ground truth output so we pass in false for the boolean outputRequired argument to RowProcessor.generateExample.

In [19]:
Map<String,String> newRow = Map.of("id","21","timestamp","03/11/2020 16:07","height","1.75","description","brown leather trenchcoat, grey hair, grey goatee","transport","police-box","extra-a","0.81754","extra-b","2.56158","extra-c","-1.21636");
var headers = Collections.unmodifiableList(new ArrayList<>(newRow.keySet()));
var row = new ColumnarIterator.Row(21,headers,newRow);
newRowProcessor.expandRegexMapping(headers);
Example<Label> example = newRowProcessor.generateExample(row,false).get();
example.toString();
Out[19]:
ArrayExample(numFeatures=19,output=UNK,weight=1.0,metadata={id=21, timestamp=2020-11-03},features=[(description@1-N=,, 1.0)(description@1-N=brown, 1.0), (description@1-N=goatee, 1.0), (description@1-N=grey, 1.0), (description@1-N=hair, 1.0), (description@1-N=leather, 1.0), (description@1-N=trenchcoat, 1.0), (description@2-N=,/grey, 1.0), (description@2-N=brown/leather, 1.0), (description@2-N=grey/goatee, 1.0), (description@2-N=grey/hair, 1.0), (description@2-N=hair/,, 1.0), (description@2-N=leather/trenchcoat, 1.0), (description@2-N=trenchcoat/,, 1.0), (extra-a@value, 0.81754), (extra-b@value, 2.56158), (extra-c@value, -1.21636), (height@value, 1.75), (transport@police-box, 1.0), ])

We can see the metadata and features have been extracted as before. We didn't supply an "example-weight" field so the weight is set to the default value of 1.0. As there was no disposition field, we can see the output has been set to the sentinel unknown output, shown here as UNK. But we can ask our simple linear model what the disposition for this example should be:

In [20]:
var prediction = model.predict(example);
prediction.toString();
Out[20]:
Prediction(maxLabel=(BAD,0.96797245146932),outputScores={BAD=(BAD,0.96797245146932)GOOD=(GOOD,0.03202754853068013})

It appears that our model thinks this example is BAD, though personally I'm not so sure that's the right label. Either way, we managed to produce a test time example using only information encoded in our model, so our ETL pipeline is stored safely inside the model, ready whenever we need it.

Conclusion

We've used Tribuo's columnar data infrastructure to process two different kinds of columnar input, csv files and json files. We saw how the central part of the columnar infrastructure, the RowProcessor, can be configured to extract different kinds of features, metadata and outputs, and how it is stored along with the rest of the training metadata in a trained model's provenance object. Finally we saw how to extract the RowProcessor from the model provenance and use it to generate an example at inference time to replicate the input processing.