Class SequenceDataset<T extends Output<T>>

java.lang.Object
org.tribuo.sequence.SequenceDataset<T>
Type Parameters:
T - the type of the outputs in the data set.
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>, Serializable, Iterable<SequenceExample<T>>, ProtoSerializable<org.tribuo.protos.core.SequenceDatasetProto>
Direct Known Subclasses:
ImmutableSequenceDataset, MutableSequenceDataset

public abstract class SequenceDataset<T extends Output<T>> extends Object implements Iterable<SequenceExample<T>>, ProtoSerializable<org.tribuo.protos.core.SequenceDatasetProto>, com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>, Serializable
A class for sets of data, which are used to train and evaluate classifiers.

Subclass either MutableSequenceDataset or ImmutableSequenceDataset rather than this class.

See Also:
  • Field Details

    • outputFactory

      protected final OutputFactory<T extends Output<T>> outputFactory
      A factory for making OutputInfo and Output of the appropriate type.
    • data

      protected final List<SequenceExample<T extends Output<T>>> data
      The data in this data set.
    • tribuoVersion

      protected final String tribuoVersion
      The version of Tribuo which created this dataset.
    • sourceProvenance

      protected final DataProvenance sourceProvenance
      The provenance of the data source, extracted on construction.
  • Constructor Details

    • SequenceDataset

      protected SequenceDataset(DataProvenance sourceProvenance, OutputFactory<T> outputFactory)
      Constructs a sequence dataset using the current Tribuo version.
      Parameters:
      sourceProvenance - The provenance.
      outputFactory - The output factory.
    • SequenceDataset

      protected SequenceDataset(DataProvenance sourceProvenance, OutputFactory<T> outputFactory, String tribuoVersion)
      Constructs a sequence dataset.
      Parameters:
      sourceProvenance - The provenance.
      outputFactory - The output factory.
      tribuoVersion - The Tribuo version string.
  • Method Details

    • getSourceDescription

      public String getSourceDescription()
      Returns the description of the source provenance.
      Returns:
      The source provenance in text form.
    • getData

      public List<SequenceExample<T>> getData()
      Returns an unmodifiable view on the data.
      Returns:
      The data.
    • getSourceProvenance

      public DataProvenance getSourceProvenance()
      Returns the source provenance.
      Returns:
      The source provenance.
    • getOutputs

      public abstract Set<T> getOutputs()
      Gets the set of labels that occur in the examples in this dataset.
      Returns:
      the set of labels that occur in the examples in this dataset.
    • getExample

      public SequenceExample<T> getExample(int index)
      Gets the example at the specified index, or throws IllegalArgumentException if the index is out of bounds.
      Parameters:
      index - The index.
      Returns:
      The example at that index.
    • getFlatDataset

      public Dataset<T> getFlatDataset()
      Returns a view on this SequenceDataset which aggregates all the examples and ignores the sequence structure.
      Returns:
      A flattened view on this dataset.
    • size

      public int size()
      Gets the size of the data set.
      Returns:
      the size of the data set.
    • getOutputIDInfo

      public abstract ImmutableOutputInfo<T> getOutputIDInfo()
      An immutable view on the output info in this dataset.
      Returns:
      The output info.
    • getOutputInfo

      public abstract OutputInfo<T> getOutputInfo()
      The output info in this dataset.
      Returns:
      The output info.
    • getFeatureIDMap

      public abstract ImmutableFeatureMap getFeatureIDMap()
      An immutable view on the feature map.
      Returns:
      The feature map.
    • getFeatureMap

      public abstract FeatureMap getFeatureMap()
      The feature map.
      Returns:
      The feature map.
    • getOutputFactory

      public OutputFactory<T> getOutputFactory()
      Gets the output factory.
      Returns:
      The output factory.
    • iterator

      public Iterator<SequenceExample<T>> iterator()
      Specified by:
      iterator in interface Iterable<T extends Output<T>>
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • validate

      public boolean validate(Class<? extends Output<?>> clazz)
      Validates that this SequenceDataset does in fact contain the supplied output type.

      As the output type is erased at runtime, deserialising a SequenceDataset is an unchecked operation. This method allows the user to check that the deserialised dataset is of the appropriate type, rather than seeing if the Dataset throws a ClassCastException when used.

      Parameters:
      clazz - The class object to verify the output type against.
      Returns:
      True if the output type is assignable to the class object type, false otherwise.
    • castDataset

      public static <T extends Output<T>> SequenceDataset<T> castDataset(SequenceDataset<?> inputDataset, Class<T> outputType)
      Casts the dataset to the specified output type, assuming it is valid.

      If it's not valid, throws ClassCastException.

      Type Parameters:
      T - The output type.
      Parameters:
      inputDataset - The model to cast.
      outputType - The output type to cast to.
      Returns:
      The model cast to the correct value.
    • deserialize

      public static SequenceDataset<?> deserialize(org.tribuo.protos.core.SequenceDatasetProto sequenceProto)
      Deserializes a sequence dataset proto into a sequence dataset.
      Parameters:
      sequenceProto - The proto to deserialize.
      Returns:
      The sequence dataset.
    • deserializeFromFile

      public static SequenceDataset<?> deserializeFromFile(Path path) throws IOException
      Reads an instance of SequenceDatasetProto from the supplied path and deserializes it.
      Parameters:
      path - The path to read.
      Returns:
      The deserialized sequence dataset.
      Throws:
      IOException - If the path could not be read from, or the parsing failed.
    • deserializeFromStream

      public static SequenceDataset<?> deserializeFromStream(InputStream is) throws IOException
      Reads an instance of SequenceDatasetProto from the supplied input stream and deserializes it.
      Parameters:
      is - The input stream to read.
      Returns:
      The deserialized sequence dataset.
      Throws:
      IOException - If the stream could not be read from, or the parsing failed.
    • serializeToFile

      public void serializeToFile(Path path) throws IOException
      Serializes this sequence dataset to a SequenceDatasetProto and writes it to the supplied path.
      Parameters:
      path - The path to write to.
      Throws:
      IOException - If the path could not be written to.
    • serializeToStream

      public void serializeToStream(OutputStream stream) throws IOException
      Serializes this sequence dataset to a SequenceDatasetProto and writes it to the supplied output stream.

      Does not close the stream.

      Parameters:
      stream - The output stream to write to.
      Throws:
      IOException - If the stream could not be written to.
    • createDataCarrier

      protected DatasetDataCarrier<T> createDataCarrier(FeatureMap featureMap, OutputInfo<T> outputInfo)
      Constructs the data carrier for serialization.
      Parameters:
      featureMap - The feature domain.
      outputInfo - The output domain.
      Returns:
      The serialization data carrier.
    • deserializeExamples

      protected static List<SequenceExample<?>> deserializeExamples(List<org.tribuo.protos.core.SequenceExampleProto> examplesList, Class<?> outputClass, FeatureMap fmap)
      Deserializes a list of sequence example protos into a list of sequence examples.
      Parameters:
      examplesList - The protos.
      outputClass - The output class.
      fmap - The feature domain.
      Returns:
      The list of deserialized sequence examples.