Loading columnar data into Tribuo¶
This tutorial demonstrates Tribuo's systems for loading and featurising complex columnar data like csv files, json, and SQL database tables.
Tribuo's Example
objects are tuples of a feature list and an output. The features are tuples of String names, and double values. The outputs differ depending by task. For example, classification tasks use Label
objects, which denote named members of a finite categorical set, while regression tasks use Regressor
objects which denote (optionally multiple, optionally named) real values. (Note that this means that Tribuo Regressor
s support multidimensional regression by default). Unlike standard Machine Learning data formats like svmlight/libsvm or IDX, tabular data needs to be converted into Tribuo's representation before it can be loaded. In Tribuo this ETL (Extract-Transform-Load) problem is solved by the org.tribuo.data.columnar
package, and specifically the RowProcessor
and associated infrastructure.
The RowProcessor¶
Constructing an instance of RowProcessor<T>
equips it with the configuration information that it needs to accept a Map<String,String>
representing the keys and values extracted from a single row of data, and emit an Example<T>
, performing all specified feature and output conversions. In addition to directly processing the values into features or outputs, the RowProcessor
can extract other fields from the row and write them into the Example
metadata allowing the storage of timestamps, row id numbers and other information.
There are four configurable elements to the RowProcessor
:
FieldProcessor
which converts a single cell of a columnar input (read as a String) into potentially many namedFeature
objects.FeatureProcessor
which processes a list ofFeature
objects (e.g., to remove duplicates, add conjunctions or replace features with other features).ResponseProcessor
which converts a single cell of a columnar input (read as a String) into anOutput
subclass by passing it to the specifiedOutputFactory
.FieldExtractor
which extracts a single metadata field while processing the whole row at once.
Each of these is an interface which has multiple implementations available in Tribuo, and we expect users to write custom implementations for specific processing tasks.
The RowProcessor
can be used at both training and inference time. If the output isn't available in the Example
because it's constructed at inference/deployment time then the returned Example
will contain a sentinel "Unknown" output of the appropriate type. Thanks to Tribuo's provenance system, the RowProcessor
configuration is stored inside the Model
object and so can be reconstructed at inference time with just the model, no other code is required.
ColumnarDataSource¶
The RowProcessor
provides the transformation logic that converts data from rows into features, outputs, and metadata, and the remainder of the logic that generates the row and Example
objects resides in ColumnarDataSource
. In general you'll use one of it's subclasses: CSVDataSource
, JsonDataSource
or SQLDataSource
, depending on what format the input data is in. If there are other columnar formats that you'd like to see, open an issue on our GitHub page and we can discuss how that could fit into Tribuo.
In this tutorial we'll process a couple of example files in csv and json formats to see how to programmatically construct a CSVDataSource
and a JsonDataSource
, check that the two formats produce the same examples, and then look at how to use the RowProcessor
at inference time.
Setup¶
First we need to pull in the necessary Tribuo jars. We use the classification jar for the small test model we build at the end, and the json jar for json processing (naturally).
%jars ./tribuo-classification-experiments-4.3.0-jar-with-dependencies.jar
%jars ./tribuo-json-4.3.0-jar-with-dependencies.jar
Import the necessary classes from Tribuo and the JDK:
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.charset.StandardCharsets;
import java.util.Locale;
import java.util.stream.*;
import com.oracle.labs.mlrg.olcut.config.ConfigurationManager;
import com.oracle.labs.mlrg.olcut.provenance.ProvenanceUtil;
import org.tribuo.*;
import org.tribuo.data.columnar.*;
import org.tribuo.data.columnar.processors.field.*;
import org.tribuo.data.columnar.processors.response.*;
import org.tribuo.data.columnar.extractors.*;
import org.tribuo.data.csv.CSVDataSource;
import org.tribuo.data.text.impl.BasicPipeline;
import org.tribuo.json.JsonDataSource;
import org.tribuo.classification.*;
import org.tribuo.classification.sgd.linear.LogisticRegressionTrainer;
import org.tribuo.util.tokens.impl.BreakIteratorTokenizer;
Reading a CSV file¶
Tribuo provides two mechanisms for reading CSV files, CSVLoader
and CSVDataSource
. We saw CSVLoader
in the classification and regression tutorials, and it's designed for loading CSV files which are purely numeric apart from the response column, and when all non-response columns should be used as features. CSVLoader
is designed to get a simple file off disk and into Tribuo as quickly as possible, but it doesn't capture most uses of CSV files. It is intentionally quite restrictive, which is why CSVDataSource
exists. CSVDataSource
in contrast provides full control over how features, outputs and metadata fields are processed and populated from a given CSV file and is the standard way of loading CSV files into Tribuo.
Now let's look at the first few rows of the example CSV file we're going to use (note these example files live next to this notebook in Tribuo's tutorials directory):
var csvPath = Paths.get("columnar-data","columnar-example.csv");
var csvLines = Files.readAllLines(csvPath, StandardCharsets.UTF_8);
csvLines.stream().limit(5).forEach(System.out::println);
There are three metadata fields ("id", "timestamp" and "example-weight"), a numerical field ("height"), a text field ("description"), two categorical fields ("transport" and "disposition"), and a further three numerical fields ("extra-a", "extra-b", "extra-c"). This is a small generated dataset, without a clear classification boundary, it's just to demonstrate the features of the RowProcessor
. We're going to pick "disposition" as the target field for our classification, with the two possible labels "Good" and "Bad".
We construct the necessary field processors, one that uses the double value directly for height, one which processes the field using a text pipeline emitting bigrams for description, and one which generates a one hot encoded categorical for transport. For more details on the options for processing text you can see the document classification tutorial which discusses Tribuo's built-in text processing options.
var textPipeline = new BasicPipeline(new BreakIteratorTokenizer(Locale.US),2);
var fieldProcessors = new ArrayList<FieldProcessor>();
fieldProcessors.add(new DoubleFieldProcessor("height"));
fieldProcessors.add(new TextFieldProcessor("description",textPipeline));
fieldProcessors.add(new IdentityProcessor("transport"));
For the remaining three fields we use the regular expression matching function to generate the field processors for us. We supply the regex "extra.*" and the RowProcessor
will copy the supplied FieldProcessor
for each field which matches the regex. In this case it will generate three DoubleFieldProcessors
in total, one each for "extra-A", "extra-B" and "extra-C". Note the field name that's supplied to the DoubleFieldProcessor
will be ignored when the new processors are generated for each field which matches the regex.
var regexMappingProcessors = new HashMap<String,FieldProcessor>();
regexMappingProcessors.put("extra.*", new DoubleFieldProcessor("extra.*"));
Now we construct the response processor for the "disposition" field. As it's a categorical and we're performing classification then the standard FieldResponseProcessor
will do the trick.
var responseProcessor = new FieldResponseProcessor("disposition","UNK",new LabelFactory());
Finally we setup the metadata extraction. This step is optional, the row processor ignores fields that don't have a FieldProcessor
or ResponseProcessor
mapping, but it's useful to be able to link an example back to the original data when using the predictions downstream.
var metadataExtractors = new ArrayList<FieldExtractor<?>>();
metadataExtractors.add(new IntExtractor("id"));
metadataExtractors.add(new DateExtractor("timestamp","timestamp","dd/MM/yyyy HH:mm"));
In the DateExtractor
the first "timestamp" is the name of the field from which we're extracting, while the second is the name to give the extracted date in the metadata store.
We'll also make a weight extractor which reads from the "example-weight" field.
var weightExtractor = new FloatExtractor("example-weight");
Now we can construct the RowProcessor
. If you don't want to weight the examples you can set the second argument to null. Similarly we're not doing any feature processing in this example, so we'll supply Collections.emptySet()
.
var rowProcessor = new RowProcessor.Builder<Label>()
.setMetadataExtractors(metadataExtractors)
.setWeightExtractor(weightExtractor)
.setRegexMappingProcessors(regexMappingProcessors)
.setFieldProcessors(fieldProcessors)
.build(responseProcessor);
With a row processor built, we can finally construct the CSVDataSource
to read our file. Note that the RowProcessor
uses Tribuo's configuration system, so we can construct one from a configuration file to save hard coding the csv (or other format) schema. We can also write out the RowProcessor
instance we've created into a configuration file for later use. We'll look at this later on when we rebuild this row processor from a trained Model's provenance.
var csvSource = new CSVDataSource<Label>(csvPath,rowProcessor,true);
// The boolean argument indicates whether the reader should fail if an output value is missing.
// Typically it is true at train/test time, but false in deployment/live use when true output values are unknown.
var datasetFromCSV = new MutableDataset<Label>(csvSource);
System.out.println("Number of examples = " + datasetFromCSV.size());
System.out.println("Number of features = " + datasetFromCSV.getFeatureMap().size());
System.out.println("Label domain = " + datasetFromCSV.getOutputIDInfo().getDomain());
Let's look at the first example and see what features and metadata are extracted.
public void printExample(Example<Label> e) {
System.out.println("Output = " + e.getOutput().toString());
System.out.println("Metadata = " + e.getMetadata());
System.out.println("Weight = " + e.getWeight());
System.out.println("Features = [" + StreamSupport.stream(e.spliterator(), false).map(Feature::toString).collect(Collectors.joining(",")) + "]");
}
printExample(datasetFromCSV.getExample(0));
We can see that the output label is GOOD
and the two metadata fields have been populated, one for the id
and one for timestamp
. We extracted a weight of 0.5, as by default it is 1.0. Next come the text unigrams and bigrams extracted from the description
field. Unigrams are named description@1-N=<token>
and bigrams are named description@2-N=<token>,<token>
, and the value is the number of times that unigram or bigram occurred in the text. After the text features comes the three features extracted via the regex expansion, extra-a
, extra-b
and extra-c
, each with the floating point value. Then comes height
with the floating point value extracted, and finally transport
extracted as a one-hot categorical feature. That is, an example can only have either transport@police-box
or transport@starship
as a feature with the value 1.0
in any given example.
Reading a JSON file¶
Tribuo's JsonDataSource
supports reading flat json objects from a json array, which is fairly restrictive. JSON is such a flexible format it's hard to build parsers for everything, but the JsonDataSource
should be a good place to start looking if you need to write something more complicated.
We'll use a JSON version of the CSV file from above, and again we'll print the first few lines from the JSON file to show the format.
var jsonPath = Paths.get("columnar-data","columnar-example.json");
var jsonLines = Files.readAllLines(jsonPath, StandardCharsets.UTF_8);
jsonLines.stream().limit(14).forEach(System.out::println);
We can re-use the RowProcessor
from earlier, as it doesn't know anything about the serialized format of the data, and supply it to the JsonDataSource
constructor.
var jsonSource = new JsonDataSource<>(jsonPath,rowProcessor,true);
var datasetFromJson = new MutableDataset<Label>(jsonSource);
System.out.println("Number of examples = " + datasetFromJson.size());
System.out.println("Number of features = " + datasetFromJson.getFeatureMap().size());
System.out.println("Label domain = " + datasetFromJson.getOutputIDInfo().getDomain());
As the CSV file and the JSON file contain the same data, we should get the same examples out, in the same order. Note the DataSource provenances will not be the same (as the hashes, timestamps and file paths are different) so the datasets themselves won't be equal.
boolean isEqual = true;
for (int i = 0; i < datasetFromJson.size(); i++) {
boolean equals = datasetFromJson.getExample(i).equals(datasetFromCSV.getExample(i));
if (!equals) {
System.out.println("Example " + i + " not equal");
System.out.println("JSON - " + datasetFromJson.getExample(i).toString());
System.out.println("CSV - " + datasetFromCSV.getExample(i).toString());
}
isEqual &= equals;
}
System.out.println("isEqual = " + isEqual);
Now we're going to train a simple logistic regression model, to show how to rebuild the RowProcessor from the model's provenance object (which allows you to rebuild the data ingest pipeline from the model itself).
First we train the model.
var model = new LogisticRegressionTrainer().train(datasetFromJson);
Then we extract the dataset provenance and convert it into a configuration.
var dataProvenance = model.getProvenance().getDatasetProvenance();
var provConfig = ProvenanceUtil.extractConfiguration(dataProvenance);
Then we feed the configuration to a ConfigurationManager
so we can rebuild the data ingest pipeline used at training time.
var cm = new ConfigurationManager();
cm.addConfiguration(provConfig);
Now the ConfigurationManager
contains all the configuration necessary to rebuild the DataSource
we used to build the model. However all we want is the RowProcessor
instance, as the JsonDataSource
itself isn't particularly useful at inference time.
RowProcessor<Label> newRowProcessor = (RowProcessor<Label>) cm.lookup("rowprocessor-1");
This row processor has the original regexes inside it rather than the concretely expanded FieldProcessor
s bound to each field that matched the regex, so first we need to expand the row processor with the headers from the original DataSource
(or our inference time data). Then we'll pass it another row and look at the Example produced to check that everything is working. As this is a test time example, we don't have a ground truth output so we pass in false
for the boolean outputRequired
argument to RowProcessor.generateExample
.
Map<String,String> newRow = Map.of("id","21","timestamp","03/11/2020 16:07","height","1.75","description","brown leather trenchcoat, grey hair, grey goatee","transport","police-box","extra-a","0.81754","extra-b","2.56158","extra-c","-1.21636");
var headers = Collections.unmodifiableList(new ArrayList<>(newRow.keySet()));
var row = new ColumnarIterator.Row(21,headers,newRow);
newRowProcessor.expandRegexMapping(headers);
Example<Label> testExample = newRowProcessor.generateExample(row,false).get();
printExample(testExample);
We can see the metadata and features have been extracted as before. We didn't supply an "example-weight" field so the weight is set to the default value of 1.0. As there was no disposition
field, we can see the output has been set to the sentinel unknown output, shown here as UNK
. But we can ask our simple linear model what the disposition for this example should be:
var prediction = model.predict(testExample);
prediction.toString();
It appears that our model thinks this example is BAD
, though personally I'm not so sure that's the right label. Either way, we managed to produce a test time example using only information encoded in our model, so our ETL pipeline is stored safely inside the model, ready whenever we need it.
Conclusion¶
We've used Tribuo's columnar data infrastructure to process two different kinds of columnar input, csv files and json files. We saw how the central part of the columnar infrastructure, the RowProcessor
, can be configured to extract different kinds of features, metadata and outputs, and how it is stored along with the rest of the training metadata in a trained model's provenance object. Finally we saw how to extract the RowProcessor
from the model provenance and use it to generate an example at inference time to replicate the input processing.