Package org.tribuo

Class MutableDataset<T extends Output<T>>

java.lang.Object
org.tribuo.Dataset<T>
org.tribuo.MutableDataset<T>
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>, Serializable, Iterable<Example<T>>, ProtoSerializable<org.tribuo.protos.core.DatasetProto>

public class MutableDataset<T extends Output<T>> extends Dataset<T>
A MutableDataset is a Dataset with a MutableFeatureMap which grows over time. Whenever an Example is added to the dataset it observes each feature and output keeping appropriate statistics in the FeatureMap and OutputInfo.
See Also:
  • Field Details

    • CURRENT_VERSION

      public static final int CURRENT_VERSION
      Protobuf serialization version.
      See Also:
    • outputMap

      protected final MutableOutputInfo<T extends Output<T>> outputMap
      Information about the outputs in this dataset.
    • featureMap

      protected final MutableFeatureMap featureMap
      A map from feature names to feature info objects.
    • transformProvenances

      protected final List<com.oracle.labs.mlrg.olcut.provenance.ObjectProvenance> transformProvenances
      The provenances of the transformations applied to this dataset.
    • dense

      protected boolean dense
      Denotes if this dataset contains implicit zeros or not.
  • Constructor Details

    • MutableDataset

      public MutableDataset(DataProvenance sourceProvenance, OutputFactory<T> outputFactory)
      Creates an empty dataset.
      Parameters:
      sourceProvenance - A description of the input data, including preprocessing steps.
      outputFactory - The output factory.
    • MutableDataset

      public MutableDataset(Iterable<Example<T>> dataSource, DataProvenance provenance, OutputFactory<T> outputFactory)
      Creates a dataset from a data source. This method will create the output and feature maps that are needed for training and evaluating classifiers.
      Parameters:
      dataSource - The examples.
      provenance - A description of the input data, including preprocessing steps.
      outputFactory - The output factory.
    • MutableDataset

      public MutableDataset(DataSource<T> dataSource)
      Creates a dataset from a data source. This method creates the output and feature maps needed for training and evaluating classifiers.
      Parameters:
      dataSource - The examples.
  • Method Details

    • deserializeFromProto

      public static MutableDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) throws com.google.protobuf.InvalidProtocolBufferException
      Deserialization factory.
      Parameters:
      version - The serialized object version.
      className - The class name.
      message - The serialized data.
      Returns:
      The deserialized object.
      Throws:
      com.google.protobuf.InvalidProtocolBufferException - If the protobuf could not be parsed from the message.
    • add

      public void add(Example<T> ex)
      Adds an example to the dataset, which observes the output and each feature value.

      It also canonicalises the reference to each feature's name (i.e., replacing the reference to a feature's name with the canonical one stored in this Dataset's VariableInfo). This greatly reduces the memory footprint.

      Parameters:
      ex - The example to add.
    • addAll

      public void addAll(Collection<? extends Example<T>> collection)
      Adds all the Examples in the supplied collection to this dataset.
      Parameters:
      collection - The collection of Examples.
    • setWeights

      public void setWeights(Map<T,Float> weights)
      Sets the weights in each example according to their output.
      Parameters:
      weights - A map of Outputs to float weights.
    • getOutputs

      public Set<T> getOutputs()
      Gets the set of possible outputs in this dataset.

      In the case of regression returns a Set containing dimension names.

      Specified by:
      getOutputs in class Dataset<T extends Output<T>>
      Returns:
      The set of possible outputs.
    • getFeatureIDMap

      public ImmutableFeatureMap getFeatureIDMap()
      Description copied from class: Dataset
      Returns or generates an ImmutableFeatureMap.
      Specified by:
      getFeatureIDMap in class Dataset<T extends Output<T>>
      Returns:
      An immutable feature map with id numbers.
    • getFeatureMap

      public MutableFeatureMap getFeatureMap()
      Description copied from class: Dataset
      Returns this dataset's FeatureMap.
      Specified by:
      getFeatureMap in class Dataset<T extends Output<T>>
      Returns:
      The feature map from this dataset.
    • getOutputIDInfo

      public ImmutableOutputInfo<T> getOutputIDInfo()
      Description copied from class: Dataset
      Returns or generates an ImmutableOutputInfo.
      Specified by:
      getOutputIDInfo in class Dataset<T extends Output<T>>
      Returns:
      An immutable output info.
    • getOutputInfo

      public OutputInfo<T> getOutputInfo()
      Description copied from class: Dataset
      Returns this dataset's OutputInfo.
      Specified by:
      getOutputInfo in class Dataset<T extends Output<T>>
      Returns:
      The output info.
    • toString

      public String toString()
      Overrides:
      toString in class Dataset<T extends Output<T>>
    • isDense

      public boolean isDense()
      Is the dataset dense (i.e., do all features in the domain have a value in each example).
      Returns:
      True if the dataset is dense.
    • transform

      public void transform(TransformerMap transformerMap)
      Applies all the transformations from the TransformerMap to this dataset.
      Parameters:
      transformerMap - The transformations to apply.
    • densify

      public void densify()
      Iterates through the examples, converting implicit zeros into explicit zeros.
    • clear

      public void clear()
      Clears all the examples out of this dataset, and flushes the FeatureMap, OutputInfo, and transform provenances.
    • regenerateOutputInfo

      public void regenerateOutputInfo()
      Rebuilds the output info by inspecting each example.
    • regenerateFeatureInfo

      public void regenerateFeatureInfo()
      Rebuilds the feature info by inspecting each example.
    • getProvenance

      public DatasetProvenance getProvenance()
    • serialize

      public org.tribuo.protos.core.DatasetProto serialize()
      Description copied from interface: ProtoSerializable
      Serializes this object to a protobuf.
      Returns:
      The protobuf.
    • createDeepCopy

      public static <T extends Output<T>> MutableDataset<T> createDeepCopy(Dataset<T> other)
      Creates a deep copy of the supplied Dataset which is mutable.

      Copies the individual examples using their copy method.

      Type Parameters:
      T - The output type.
      Parameters:
      other - The dataset to copy.
      Returns:
      A mutable deep copy of the dataset.