Tribuo has a modular structure to allow minimal dependencies for any specific deployment. We describe the overall package structure below.
The top level project has core modules which define the API, data interactions, a math library, and common modules shared across prediction types.
tribuo-core, package root:
org.tribuo) Provides the main classes and interfaces:
tribuo-data, package root:
org.tribuo.data) provides classes which deal with sampled data, columnar data, csv files and text inputs. The user is encouraged to provide their own text processing infrastructure implementation, as the one here is fairly basic.
tribuo-json, package root:
org.tribuo.json) provides functionality for loading from json data sources, and for stripping provenance out of a model.
tribuo-math, package root:
org.tribuo.math) provides a linear algebra library for working with both sparse and dense vectors and matrices.
Multi-class classification is the act of assigning a single label from a set of labels to a test example. The classification module has several submodules:
||Contains an Output subclass for use with multi-class classification tasks, evaluation code for checking model performance, and an implementation of Adaboost.SAMME. It also contains simple baseline classifiers.|
||An implementation of CART decision trees.|
||A set of main functions for training & testing models on any supported dataset. This submodule depends on all the classifiers and allows easy comparison between them. It should not be imported into other projects since it is intended purely for development and testing.|
||An implementation of LIME for classification tasks. If you use the columnar data loader, LIME can extract more information about the feature domain and provide better explanations.|
||A wrapper around the LibLinear-java library. This provides linear-SVMs and other l1 or l2 regularised linear classifiers.|
||A wrapper around the Java version of LibSVM. This provides linear & kernel SVMs with sigmoid, gaussian and polynomial kernels.|
|Multinomial Naive Bayes||
||An implementation of a multinomial naive bayes classifier. Since it aims to store a compact in-memory representation of the model, it only keeps track of weights for observed feature/class pairs.|
||An implementation of stochastic gradient descent based classifiers. It includes a linear package for logistic regression and linear-SVM (using log and hinge losses, respectively), a kernel package for training a kernel-SVM using the Pegasos algorithm, and a crf package for training a linear-chain CRF. These implementations depend upon the stochastic gradient optimisers in the main Math package. The linear and crf packages can use any of the provided gradient optimisers, which enforce various different kinds of regularisation or convergence metrics. This is the preferred package for linear classification and for sequence classification due to the speed and scalability of the SGD approach.|
||A wrapper around the XGBoost Java API. XGBoost requires a C library accessed via JNI. XGBoost is a scalable implementation of gradient boosted trees.|
Multi-label classification is the task of predicting a set of labels for a test
example rather than just a single label. This package provides an Output
subclass for multi-label prediction, evaluation code for checking the
performance of a multi-label model, and a basic implementation of independent
binary predictions. The multi-label support is found in the
artifact, in the
The independent binary predictor breaks each multi-label prediction into n binary predictions, one for each possible label. To achieve this, the supplied trainer takes a classification trainer and uses it to build n models, one per label, which are then run in sequence on a test example to produce the final multi-label output.
Regression is the task of predicting real-valued outputs for a test example. This package provides several modules:
||contains an Output subclass for use with regression data, as well as evaluation code for checking model performance using standard regression metrics (R^2, explained variance, RMSE, and mean absolute error). The module also contains simple baseline regressions.|
||A wrapper around the LibLinear-java library. This provides linear-SVMs and other l1 or l2 regularised linear regressions.|
||A wrapper around the Java version of LibSVM. This provides linear & kernel SVRs with sigmoid, gaussian and polynomial kernels.|
||An implementation of two types of CART regression trees. The first type builds a separate tree per output dimension, while the second type builds a single tree for all outputs.|
||An implementation of stochastic gradient descent for linear regression. It uses the main Math package’s set of gradient optimisers, which allow for various regularisation and descent algorithms.|
||An implementation of sparse linear models. It includes a co-ordinate descent implementation of ElasticNet, a LARS implementation, a LASSO implementation using LARS, and a couple of sequential forward selection algorithms.|
||A wrapper around the XGBoost Java API. XGBoost requires a C library accessed via JNI.|
Clustering is the task of grouping input data. The clustering system implemented is single membership – each datapoint is assigned to one and only one cluster. This package provides two modules:
||Contains the Output subclass for use with clustering data, as well as the evaluation code for measuring clustering performance.|
||An implementation of K-Means using the Java 8 Stream API for parallelisation.|
Anomaly detection is the task of finding outliers or anomalies at prediction time using a model trained on non-anomalous data. This package provides two modules:
||Contains the Output subclass for use with anomaly detection data.|
||A wrapper around the Java version of LibSVM, which provides a one-class SVM.|
The common module shares code across multiple prediction types. It provides the base support for LibLinear, LibSVM, nearest neighbour, tree, and XGBoost models. The nearest neighbour submodule is standalone, however the rest of the submodules require the prediction specific implementation modules.
Tribuo supports loading a number of third party models which were trained outside the system (even in other programming languages) and scoring them from Java using Tribuo’s infrastructure. Currently, we support loading ONNX, TensorFlow and XGBoost models.
tribuo-onnxartifact in the
Tribuo includes experimental support for TensorFlow 1.14 in the
artifact in the
org.tribuo.interop.tensorflow package. Due to a lack of
flexibility in TensorFlow 1.14’s Java API, models still need to be specified in
python, and have their graph definitions written out as a protobuf. The Java
code accepts this protobuf and trains a model that can be used purely from
Java. It includes a Java serialisation system so that all TensorFlow models can
be serialised and deserialised in the same way as other Tribuo models.
TensorFlow models run by default on GPU if one is available.
This support remains experimental while the TF JVM SIG rewrites the TensorFlow Java API. We participate in the TensorFlow JVM SIG, and the upcoming releases from that group will include full Java support for training models without the need to define the model in Python before training.
Tribuo demonstrates the TensorFlow interop by including an example config file, python model generation file and protobuf for an MNIST model. In addition to the libraries gathered by the Tribuo TensorFlow jar, it is necessary to include libtensorflow_jni.so and libtensorflow_framework.so in your java.library.path.