Class MutableSequenceDataset<T extends Output<T>>

java.lang.Object
org.tribuo.sequence.SequenceDataset<T>
org.tribuo.sequence.MutableSequenceDataset<T>
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>, Serializable, Iterable<SequenceExample<T>>, ProtoSerializable<org.tribuo.protos.core.SequenceDatasetProto>

public class MutableSequenceDataset<T extends Output<T>> extends SequenceDataset<T>
A MutableSequenceDataset is a SequenceDataset with a MutableFeatureMap which grows over time. Whenever an SequenceExample is added to the dataset.
See Also:
  • Field Details

    • CURRENT_VERSION

      public static final int CURRENT_VERSION
      Protobuf serialization version.
      See Also:
    • outputInfo

      protected final MutableOutputInfo<T extends Output<T>> outputInfo
      A map from labels to IDs for the labels found in this dataset.
    • featureMap

      protected final MutableFeatureMap featureMap
      A map from feature names to IDs for the features found in this dataset.
    • dense

      protected boolean dense
      Does this dataset have a dense feature space.
  • Constructor Details

    • MutableSequenceDataset

      public MutableSequenceDataset(DataProvenance sourceProvenance, OutputFactory<T> outputFactory)
      Creates an empty sequence dataset.
      Parameters:
      sourceProvenance - A description of the input data, including preprocessing steps.
      outputFactory - The output factory.
    • MutableSequenceDataset

      public MutableSequenceDataset(Iterable<SequenceExample<T>> dataSource, DataProvenance sourceProvenance, OutputFactory<T> outputFactory)
      Creates a dataset from a data source. This method will create the output and feature ID maps that are needed for training and evaluating classifiers.
      Parameters:
      dataSource - The input data.
      sourceProvenance - A description of the data, including preprocessing steps.
      outputFactory - The output factory.
    • MutableSequenceDataset

      public MutableSequenceDataset(SequenceDataSource<T> dataSource)
      Builds a dataset from the supplied data source.
      Parameters:
      dataSource - The data source.
    • MutableSequenceDataset

      public MutableSequenceDataset(ImmutableSequenceDataset<T> dataset)
      Copies the immutable dataset into a mutable dataset.

      This should be infrequently used and mostly exists for the ViterbiTrainer.

      Parameters:
      dataset - The dataset to copy.
  • Method Details

    • deserializeFromProto

      public static MutableSequenceDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) throws com.google.protobuf.InvalidProtocolBufferException
      Deserialization factory.
      Parameters:
      version - The serialized object version.
      className - The class name.
      message - The serialized data.
      Returns:
      The deserialized object.
      Throws:
      com.google.protobuf.InvalidProtocolBufferException - If the protobuf could not be parsed from the message.
    • clear

      public void clear()
      Clears all the examples out of this dataset, and flushes the FeatureMap, OutputInfo, and transform provenances.
    • add

      public void add(SequenceExample<T> ex)
      Adds a SequenceExample to this dataset.

      It also canonicalises the reference to each feature's name (i.e., replacing the reference to a feature's name with the canonical one stored in this Dataset's VariableInfo). This greatly reduces the memory footprint.

      Parameters:
      ex - The example to add.
    • addAll

      public void addAll(Collection<SequenceExample<T>> collection)
      Adds all the SequenceExamples in the supplied collection to this dataset.
      Parameters:
      collection - The collection of SequenceExamples.
    • getOutputs

      public Set<T> getOutputs()
      Description copied from class: SequenceDataset
      Gets the set of labels that occur in the examples in this dataset.
      Specified by:
      getOutputs in class SequenceDataset<T extends Output<T>>
      Returns:
      the set of labels that occur in the examples in this dataset.
    • getFeatureIDMap

      public ImmutableFeatureMap getFeatureIDMap()
      Description copied from class: SequenceDataset
      An immutable view on the feature map.
      Specified by:
      getFeatureIDMap in class SequenceDataset<T extends Output<T>>
      Returns:
      The feature map.
    • getFeatureMap

      public MutableFeatureMap getFeatureMap()
      Description copied from class: SequenceDataset
      The feature map.
      Specified by:
      getFeatureMap in class SequenceDataset<T extends Output<T>>
      Returns:
      The feature map.
    • getOutputIDInfo

      public ImmutableOutputInfo<T> getOutputIDInfo()
      Description copied from class: SequenceDataset
      An immutable view on the output info in this dataset.
      Specified by:
      getOutputIDInfo in class SequenceDataset<T extends Output<T>>
      Returns:
      The output info.
    • getOutputInfo

      public OutputInfo<T> getOutputInfo()
      Description copied from class: SequenceDataset
      The output info in this dataset.
      Specified by:
      getOutputInfo in class SequenceDataset<T extends Output<T>>
      Returns:
      The output info.
    • isDense

      public boolean isDense()
      Is the dataset dense (i.e., do all features in the domain have a value in each example).
      Returns:
      True if the dataset is dense.
    • densify

      public void densify()
      Iterates through the examples, converting implicit zeros into explicit zeros.
    • toString

      public String toString()
      Overrides:
      toString in class SequenceDataset<T extends Output<T>>
    • getProvenance

      public DatasetProvenance getProvenance()
    • serialize

      public org.tribuo.protos.core.SequenceDatasetProto serialize()
      Description copied from interface: ProtoSerializable
      Serializes this object to a protobuf.
      Returns:
      The protobuf.