Package org.tribuo

Class ImmutableDataset<T extends Output<T>>

java.lang.Object
org.tribuo.Dataset<T>
org.tribuo.ImmutableDataset<T>
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>, Serializable, Iterable<Example<T>>, ProtoSerializable<org.tribuo.protos.core.DatasetProto>
Direct Known Subclasses:
DatasetView, MinimumCardinalityDataset, SelectedFeatureDataset

public class ImmutableDataset<T extends Output<T>> extends Dataset<T> implements Serializable
This is a Dataset which has an ImmutableFeatureMap to store the feature information. Whenever an example is added to this dataset it removes features that do not exist in the FeatureMap. The dataset is immutable after construction (unless the examples are modified).

This class is mostly for performance optimisations inside the framework, and should not generally be used by external code.

See Also:
  • Field Details

    • CURRENT_VERSION

      public static final int CURRENT_VERSION
      Protobuf serialization version.
      See Also:
    • outputIDInfo

      protected ImmutableOutputInfo<T extends Output<T>> outputIDInfo
      Output information, and id numbers for outputs found in this dataset.
    • featureIDMap

      protected ImmutableFeatureMap featureIDMap
      A map from feature names to IDs for the features found in this dataset.
    • dropInvalidExamples

      protected final boolean dropInvalidExamples
      If true, instead of throwing an exception when an invalid Example is encountered, this Dataset will log a warning and drop it.
  • Constructor Details

    • ImmutableDataset

      protected ImmutableDataset(DataProvenance description, OutputFactory<T> outputFactory)
      If you call this it's your job to setup outputMap, featureIDMap and fill it with examples.

      Note: Sets dropInvalidExamples to false.

      Parameters:
      description - A description of the input data (including preprocessing steps).
      outputFactory - The factory for this output type.
    • ImmutableDataset

      public ImmutableDataset(DataSource<T> dataSource, Model<T> model, boolean dropInvalidExamples)
      Creates a dataset from a data source. It copies the feature and output maps from the supplied model.
      Parameters:
      dataSource - The examples.
      model - A model to extract feature and output maps from.
      dropInvalidExamples - If true, instead of throwing an exception when an invalid Example is encountered, this Dataset will log a warning and drop it.
    • ImmutableDataset

      public ImmutableDataset(DataSource<T> dataSource, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo, boolean dropInvalidExamples)
      Creates a dataset from a data source. Creates immutable feature and output maps from the supplied ones.
      Parameters:
      dataSource - The examples.
      featureIDMap - The feature map.
      outputIDInfo - The output map.
      dropInvalidExamples - If true, instead of throwing an exception when an invalid Example is encountered, this Dataset will log a warning and drop it.
    • ImmutableDataset

      public ImmutableDataset(Iterable<Example<T>> dataSource, DataProvenance description, OutputFactory<T> outputFactory, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo, boolean dropInvalidExamples)
      Creates a dataset from a data source. Creates immutable feature and output maps from the supplied ones.
      Parameters:
      dataSource - The examples.
      description - A description of the input data (including preprocessing steps).
      outputFactory - The output factory.
      featureIDMap - The feature id map, used to remove unknown features.
      outputIDInfo - The output id map.
      dropInvalidExamples - If true, instead of throwing an exception when an invalid Example is encountered, this Dataset will log a warning and drop it.
    • ImmutableDataset

      public ImmutableDataset(Iterable<Example<T>> dataSource, DataProvenance description, OutputFactory<T> outputFactory, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, boolean dropInvalidExamples)
      Creates a dataset from a data source.
      Parameters:
      dataSource - The examples.
      description - A description of the input data (including preprocessing steps).
      outputFactory - The factory for this output type.
      featureIDMap - The feature id map, used to remove unknown features.
      outputIDInfo - The output id map.
      dropInvalidExamples - If true, instead of throwing an exception when an invalid Example is encountered, this Dataset will log a warning and drop it.
    • ImmutableDataset

      protected ImmutableDataset(DataProvenance description, OutputFactory<T> outputFactory, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo)
      This is dangerous, and should not be used unless you've overridden everything in ImmutableDataset.

      Note: Sets dropInvalidExamples to false.

      Parameters:
      description - A description of the data you're going to add to this dataset.
      outputFactory - The factory for this output type.
      featureIDMap - The feature id map, used to remove unknown features.
      outputIDInfo - The output id map.
    • ImmutableDataset

      protected ImmutableDataset(DataProvenance provenance, OutputFactory<T> factory, String tribuoVersion, ImmutableFeatureMap fmap, ImmutableOutputInfo<T> outputInfo, List<Example<T>> examples, boolean dropInvalidExamples)
      Deserialization constructor.
      Parameters:
      provenance - The source provenance.
      factory - The output factory.
      tribuoVersion - The tribuo version.
      fmap - The feature id map.
      outputInfo - The output id info.
      examples - The examples.
      dropInvalidExamples - Should invalid examples be dropped when added?
  • Method Details

    • deserializeFromProto

      public static ImmutableDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) throws com.google.protobuf.InvalidProtocolBufferException
      Deserialization factory.
      Parameters:
      version - The serialized object version.
      className - The class name.
      message - The serialized data.
      Returns:
      The deserialized object.
      Throws:
      com.google.protobuf.InvalidProtocolBufferException - If the protobuf could not be parsed from the message.
    • add

      protected void add(Example<T> ex)
      Adds an Example to the dataset, which will remove features with unknown names.
      Parameters:
      ex - An Example to add to the dataset.
    • add

      protected void add(Example<T> ex, Merger merger)
      Adds a Example to the dataset, which will insert feature ids, remove unknown features and sort the examples by the feature ids (merging duplicate ids).
      Parameters:
      ex - The example to add.
      merger - The Merger to use.
    • getOutputs

      public Set<T> getOutputs()
      Description copied from class: Dataset
      Gets the set of outputs that occur in the examples in this dataset.
      Specified by:
      getOutputs in class Dataset<T extends Output<T>>
      Returns:
      the set of outputs that occur in the examples in this dataset.
    • getFeatureIDMap

      public ImmutableFeatureMap getFeatureIDMap()
      Description copied from class: Dataset
      Returns or generates an ImmutableFeatureMap.
      Specified by:
      getFeatureIDMap in class Dataset<T extends Output<T>>
      Returns:
      An immutable feature map with id numbers.
    • getFeatureMap

      public ImmutableFeatureMap getFeatureMap()
      Description copied from class: Dataset
      Returns this dataset's FeatureMap.
      Specified by:
      getFeatureMap in class Dataset<T extends Output<T>>
      Returns:
      The feature map from this dataset.
    • getOutputIDInfo

      public ImmutableOutputInfo<T> getOutputIDInfo()
      Description copied from class: Dataset
      Returns or generates an ImmutableOutputInfo.
      Specified by:
      getOutputIDInfo in class Dataset<T extends Output<T>>
      Returns:
      An immutable output info.
    • getOutputInfo

      public ImmutableOutputInfo<T> getOutputInfo()
      Description copied from class: Dataset
      Returns this dataset's OutputInfo.
      Specified by:
      getOutputInfo in class Dataset<T extends Output<T>>
      Returns:
      The output info.
    • getDropInvalidExamples

      public boolean getDropInvalidExamples()
      Returns true if this immutable dataset dropped any invalid examples on construction.
      Returns:
      True if it drops invalid examples.
    • toString

      public String toString()
      Overrides:
      toString in class Dataset<T extends Output<T>>
    • getProvenance

      public DatasetProvenance getProvenance()
      Specified by:
      getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<T extends Output<T>>
    • serialize

      public org.tribuo.protos.core.DatasetProto serialize()
      Description copied from interface: ProtoSerializable
      Serializes this object to a protobuf.
      Specified by:
      serialize in interface ProtoSerializable<T extends Output<T>>
      Returns:
      The protobuf.
    • copyDataset

      public static <T extends Output<T>> ImmutableDataset<T> copyDataset(Dataset<T> dataset)
      Creates an immutable deep copy of the supplied dataset.
      Type Parameters:
      T - The type of output.
      Parameters:
      dataset - The dataset to copy.
      Returns:
      An immutable copy of the dataset.
    • copyDataset

      public static <T extends Output<T>> ImmutableDataset<T> copyDataset(Dataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo)
      Creates an immutable deep copy of the supplied dataset, using a different feature and output map.
      Type Parameters:
      T - The type of output.
      Parameters:
      dataset - The dataset to copy.
      featureIDMap - The new feature map to use. Removes features which are not found in this map.
      outputIDInfo - The new output info to use.
      Returns:
      An immutable copy of the dataset.
    • copyDataset

      public static <T extends Output<T>> ImmutableDataset<T> copyDataset(Dataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, Merger merger)
      Creates an immutable deep copy of the supplied dataset.
      Type Parameters:
      T - The type of output.
      Parameters:
      dataset - The dataset to copy.
      featureIDMap - The new feature map to use. Removes features which are not found in this map.
      outputIDInfo - The new output info to use.
      merger - The merge function to use to reduce features given new ids.
      Returns:
      An immutable copy of the dataset.
    • hashFeatureMap

      public static <T extends Output<T>> ImmutableDataset<T> hashFeatureMap(Dataset<T> dataset, Hasher hasher)
      Creates an immutable shallow copy of the supplied dataset, using the hasher to generate a HashedFeatureMap which transparently maps from the feature name to the hashed variant.
      Type Parameters:
      T - The type of output.
      Parameters:
      dataset - The dataset to copy.
      hasher - The hashing function to use.
      Returns:
      An immutable copy of the dataset.