Class ImmutableSequenceDataset<T extends Output<T>>

java.lang.Object
org.tribuo.sequence.SequenceDataset<T>
org.tribuo.sequence.ImmutableSequenceDataset<T>
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>, Serializable, Iterable<SequenceExample<T>>, ProtoSerializable<org.tribuo.protos.core.SequenceDatasetProto>
Direct Known Subclasses:
MinimumCardinalitySequenceDataset

public class ImmutableSequenceDataset<T extends Output<T>> extends SequenceDataset<T> implements Serializable
This is a SequenceDataset which has an ImmutableFeatureMap to store the feature information. Whenever an example is added to this dataset it removes features that do not exist in the FeatureMap. The dataset is immutable after construction (unless the examples are modified).
See Also:
  • Field Details

    • CURRENT_VERSION

      public static final int CURRENT_VERSION
      Protobuf serialization version.
      See Also:
    • outputIDInfo

      protected ImmutableOutputInfo<T extends Output<T>> outputIDInfo
      A map from labels to IDs for the labels found in this dataset.
    • featureIDMap

      protected ImmutableFeatureMap featureIDMap
      A map from feature names to IDs for the features found in this dataset.
  • Constructor Details

    • ImmutableSequenceDataset

      protected ImmutableSequenceDataset(DataProvenance sourceProvenance, OutputFactory<T> outputFactory)
      If you call this it's your job to setup outputIDInfo and featureIDMap.
      Parameters:
      sourceProvenance - A description of the dataset including preprocessing steps.
      outputFactory - The output factory.
    • ImmutableSequenceDataset

      public ImmutableSequenceDataset(SequenceDataSource<T> dataSource, SequenceModel<T> model)
      Creates a dataset from a data source, taking the output and feature domains from the supplied model.
      Parameters:
      dataSource - The input data.
      model - The model to use for the feature and output domains.
    • ImmutableSequenceDataset

      public ImmutableSequenceDataset(SequenceDataSource<T> dataSource, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo)
      Creates a dataset from a data source, using the specified output and feature domains.
      Parameters:
      dataSource - The input data.
      featureIDMap - The feature domain.
      outputIDInfo - The output domain.
    • ImmutableSequenceDataset

      public ImmutableSequenceDataset(Iterable<SequenceExample<T>> dataSource, DataProvenance sourceProvenance, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo, OutputFactory<T> outputFactory)
      Creates a dataset from a data source. This method will create the output and feature ID maps that are needed for training and evaluating classifiers.
      Parameters:
      dataSource - The input data.
      sourceProvenance - A description of the data.
      featureIDMap - The feature map, used to remove unknown features.
      outputIDInfo - The output map.
      outputFactory - The output factory.
    • ImmutableSequenceDataset

      public ImmutableSequenceDataset(Iterable<SequenceExample<T>> dataSource, DataProvenance sourceProvenance, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, OutputFactory<T> outputFactory)
      Creates a dataset from a data source.
      Parameters:
      dataSource - The input data.
      sourceProvenance - A description of the data.
      featureIDMap - The feature id map, used to remove unknown features.
      outputIDInfo - The output id map.
      outputFactory - The output factory.
    • ImmutableSequenceDataset

      protected ImmutableSequenceDataset(DataProvenance sourceProvenance, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo)
      This is dangerous, and should not be used unless you've overridden everything in ImmutableSequenceDataset.
      Parameters:
      sourceProvenance - A description of the data, including all preprocessing.
      featureIDMap - The feature id map, used to remove unknown features.
      outputIDInfo - The output id map.
    • ImmutableSequenceDataset

      protected ImmutableSequenceDataset(DataProvenance provenance, OutputFactory<T> factory, String tribuoVersion, ImmutableFeatureMap fmap, ImmutableOutputInfo<T> outputInfo, List<SequenceExample<T>> examples)
      Deserialization constructor.
      Parameters:
      provenance - The source provenance.
      factory - The output factory.
      tribuoVersion - The tribuo version.
      fmap - The feature id map.
      outputInfo - The output id info.
      examples - The examples.
  • Method Details

    • deserializeFromProto

      public static ImmutableSequenceDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) throws com.google.protobuf.InvalidProtocolBufferException
      Deserialization factory.
      Parameters:
      version - The serialized object version.
      className - The class name.
      message - The serialized data.
      Returns:
      The deserialized object.
      Throws:
      com.google.protobuf.InvalidProtocolBufferException - If the protobuf could not be parsed from the message.
    • add

      protected void add(SequenceExample<T> ex)
      Adds a SequenceExample to the dataset, which will insert feature ids, remove unknown features and sort the examples by the feature ids.
      Parameters:
      ex - The example to add.
    • add

      protected void add(SequenceExample<T> ex, Merger merger)
      Adds a SequenceExample to the dataset, which will insert feature ids, remove unknown features and sort the examples by the feature ids.
      Parameters:
      ex - The example to add.
      merger - The merger to use to remove duplicate features.
    • getOutputs

      public Set<T> getOutputs()
      Description copied from class: SequenceDataset
      Gets the set of labels that occur in the examples in this dataset.
      Specified by:
      getOutputs in class SequenceDataset<T extends Output<T>>
      Returns:
      the set of labels that occur in the examples in this dataset.
    • getFeatureIDMap

      public ImmutableFeatureMap getFeatureIDMap()
      Description copied from class: SequenceDataset
      An immutable view on the feature map.
      Specified by:
      getFeatureIDMap in class SequenceDataset<T extends Output<T>>
      Returns:
      The feature map.
    • getFeatureMap

      public ImmutableFeatureMap getFeatureMap()
      Description copied from class: SequenceDataset
      The feature map.
      Specified by:
      getFeatureMap in class SequenceDataset<T extends Output<T>>
      Returns:
      The feature map.
    • getOutputIDInfo

      public ImmutableOutputInfo<T> getOutputIDInfo()
      Description copied from class: SequenceDataset
      An immutable view on the output info in this dataset.
      Specified by:
      getOutputIDInfo in class SequenceDataset<T extends Output<T>>
      Returns:
      The output info.
    • getOutputInfo

      public ImmutableOutputInfo<T> getOutputInfo()
      Description copied from class: SequenceDataset
      The output info in this dataset.
      Specified by:
      getOutputInfo in class SequenceDataset<T extends Output<T>>
      Returns:
      The output info.
    • toString

      public String toString()
      Overrides:
      toString in class SequenceDataset<T extends Output<T>>
    • serialize

      public org.tribuo.protos.core.SequenceDatasetProto serialize()
      Description copied from interface: ProtoSerializable
      Serializes this object to a protobuf.
      Specified by:
      serialize in interface ProtoSerializable<T extends Output<T>>
      Returns:
      The protobuf.
    • getProvenance

      public DatasetProvenance getProvenance()
      Specified by:
      getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<T extends Output<T>>
    • copyDataset

      public static <T extends Output<T>> ImmutableSequenceDataset<T> copyDataset(SequenceDataset<T> dataset)
      Creates an immutable deep copy of the supplied dataset.
      Type Parameters:
      T - The type of output.
      Parameters:
      dataset - The dataset to copy.
      Returns:
      An immutable copy of the dataset.
    • copyDataset

      public static <T extends Output<T>> ImmutableSequenceDataset<T> copyDataset(SequenceDataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo)
      Creates an immutable deep copy of the supplied dataset, using a different feature and output map.
      Type Parameters:
      T - The type of output.
      Parameters:
      dataset - The dataset to copy.
      featureIDMap - The new feature map to use. Removes features which are not found in this map.
      outputIDInfo - The new output info to use.
      Returns:
      An immutable copy of the dataset.
    • copyDataset

      public static <T extends Output<T>> ImmutableSequenceDataset<T> copyDataset(SequenceDataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, Merger merger)
      Creates an immutable deep copy of the supplied dataset.
      Type Parameters:
      T - The type of output.
      Parameters:
      dataset - The dataset to copy.
      featureIDMap - The new feature map to use. Removes features which are not found in this map.
      outputIDInfo - The new output info to use.
      merger - The merge function to use to reduce features given new ids.
      Returns:
      An immutable copy of the dataset.