Skip navigation links

Package org.tribuo.transform

Provides infrastructure for applying transformations to a Dataset.

See: Description

Package org.tribuo.transform Description

Provides infrastructure for applying transformations to a Dataset.

This package is the necessary infrastructure for transformations. The workflow is first to build a TransformationMap which represents the Transformations and the order that they should be applied to the specified Features. This can be applied to a Dataset to produce a TransformerMap which contains a fitted set of Transformers which can be used to apply the transformation to any other Dataset (e.g., to apply the same transformation to training and test sets), or to be used at prediction time to stream data through.

It also provides a TransformTrainer which accepts a TransformationMap and an inner Trainer and produces a TransformedModel which automatically transforms it's input data at prediction time.

Transformations don't produce new Features - they only modify the values of existing ones. When doing so they can be instructed to treat Features that are absent due to sparsity as zero or as not existing at all. Independently, we can explicitly add zero-valued Features by densifying the dataset before the transformation is fit or before it is applied. Once they exist these Features can be altered by Transformers and are visible to Transformations which are being fit.

The transformation fitting methods have two parameters which alter their behaviour: includeImplicitZeroFeatures and densify. includeImplicitZeroFeatures controls if the transformation incorporates the implicit zero valued features (i.e., the ones not present in the example but are present in the dataset's FeatureMap) when building the transformation statistics. This is important when working with, e.g. IDFTransformation as it allows correct computation of the inverse document frequency, but can be detrimental to features which are one-hot encodings of categoricals (as they have many more implicit zeros). densify controls if the example or dataset should have its implicit zero valued features converted into explicit zero valued features (i.e., it makes a sparse example into a dense one which contains an explicit value for every feature known to the dataset) before the transformation is applied, and transformations are only applied to feature values which are present.

These parameters interact to form 4 possibilities:

One further option is to call MutableDataset.densify() before passing the data to TransformTrainer.train(org.tribuo.Dataset<T>, java.util.Map<java.lang.String, com.oracle.labs.mlrg.olcut.provenance.Provenance>), which is equivalent to setting includeImplicitZeroFeatures to true and densify to true. To sum up, in the context of transformations includeImplicitZeroFeatures determines whether (implicit) zero-values features are measured and densify determines whether they can be altered.
Skip navigation links

Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.