Package org.tribuo.sequence
Class SequenceDataset<T extends Output<T>>
java.lang.Object
org.tribuo.sequence.SequenceDataset<T>
- Type Parameters:
T
- the type of the outputs in the data set.
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>
,Serializable
,Iterable<SequenceExample<T>>
,ProtoSerializable<org.tribuo.protos.core.SequenceDatasetProto>
- Direct Known Subclasses:
ImmutableSequenceDataset
,MutableSequenceDataset
public abstract class SequenceDataset<T extends Output<T>>
extends Object
implements Iterable<SequenceExample<T>>, ProtoSerializable<org.tribuo.protos.core.SequenceDatasetProto>, com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>, Serializable
A class for sets of data, which are used to train and evaluate classifiers.
Subclass either MutableSequenceDataset
or ImmutableSequenceDataset
rather than this class.
- See Also:
-
Field Summary
Modifier and TypeFieldDescriptionprotected final List<SequenceExample<T>>
The data in this data set.protected final OutputFactory<T>
A factory for makingOutputInfo
andOutput
of the appropriate type.protected final DataProvenance
The provenance of the data source, extracted on construction.protected final String
The version of Tribuo which created this dataset.Fields inherited from interface org.tribuo.protos.ProtoSerializable
DESERIALIZATION_METHOD_NAME, PROVENANCE_SERIALIZER
-
Constructor Summary
ModifierConstructorDescriptionprotected
SequenceDataset
(DataProvenance sourceProvenance, OutputFactory<T> outputFactory) Constructs a sequence dataset using the current Tribuo version.protected
SequenceDataset
(DataProvenance sourceProvenance, OutputFactory<T> outputFactory, String tribuoVersion) Constructs a sequence dataset. -
Method Summary
Modifier and TypeMethodDescriptionstatic <T extends Output<T>>
SequenceDataset<T>castDataset
(SequenceDataset<?> inputDataset, Class<T> outputType) Casts the dataset to the specified output type, assuming it is valid.protected DatasetDataCarrier<T>
createDataCarrier
(FeatureMap featureMap, OutputInfo<T> outputInfo) Constructs the data carrier for serialization.static SequenceDataset<?>
deserialize
(org.tribuo.protos.core.SequenceDatasetProto sequenceProto) Deserializes a sequence dataset proto into a sequence dataset.protected static List<SequenceExample<?>>
deserializeExamples
(List<org.tribuo.protos.core.SequenceExampleProto> examplesList, Class<?> outputClass, FeatureMap fmap) Deserializes a list of sequence example protos into a list of sequence examples.static SequenceDataset<?>
deserializeFromFile
(Path path) Reads an instance ofSequenceDatasetProto
from the supplied path and deserializes it.static SequenceDataset<?>
Reads an instance ofSequenceDatasetProto
from the supplied input stream and deserializes it.getData()
Returns an unmodifiable view on the data.getExample
(int index) Gets the example at the specified index, or throws IllegalArgumentException if the index is out of bounds.abstract ImmutableFeatureMap
An immutable view on the feature map.abstract FeatureMap
The feature map.Returns a view on this SequenceDataset which aggregates all the examples and ignores the sequence structure.Gets the output factory.abstract ImmutableOutputInfo<T>
An immutable view on the output info in this dataset.abstract OutputInfo<T>
The output info in this dataset.Gets the set of labels that occur in the examples in this dataset.Returns the description of the source provenance.Returns the source provenance.iterator()
void
serializeToFile
(Path path) Serializes this sequence dataset to aSequenceDatasetProto
and writes it to the supplied path.void
serializeToStream
(OutputStream stream) Serializes this sequence dataset to aSequenceDatasetProto
and writes it to the supplied output stream.int
size()
Gets the size of the data set.toString()
boolean
Validates that this SequenceDataset does in fact contain the supplied output type.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
Methods inherited from interface org.tribuo.protos.ProtoSerializable
serialize
Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable
getProvenance
-
Field Details
-
outputFactory
A factory for makingOutputInfo
andOutput
of the appropriate type. -
data
The data in this data set. -
tribuoVersion
The version of Tribuo which created this dataset. -
sourceProvenance
The provenance of the data source, extracted on construction.
-
-
Constructor Details
-
SequenceDataset
Constructs a sequence dataset using the current Tribuo version.- Parameters:
sourceProvenance
- The provenance.outputFactory
- The output factory.
-
SequenceDataset
protected SequenceDataset(DataProvenance sourceProvenance, OutputFactory<T> outputFactory, String tribuoVersion) Constructs a sequence dataset.- Parameters:
sourceProvenance
- The provenance.outputFactory
- The output factory.tribuoVersion
- The Tribuo version string.
-
-
Method Details
-
getSourceDescription
Returns the description of the source provenance.- Returns:
- The source provenance in text form.
-
getData
Returns an unmodifiable view on the data.- Returns:
- The data.
-
getSourceProvenance
Returns the source provenance.- Returns:
- The source provenance.
-
getOutputs
Gets the set of labels that occur in the examples in this dataset.- Returns:
- the set of labels that occur in the examples in this dataset.
-
getExample
Gets the example at the specified index, or throws IllegalArgumentException if the index is out of bounds.- Parameters:
index
- The index.- Returns:
- The example at that index.
-
getFlatDataset
Returns a view on this SequenceDataset which aggregates all the examples and ignores the sequence structure.- Returns:
- A flattened view on this dataset.
-
size
public int size()Gets the size of the data set.- Returns:
- the size of the data set.
-
getOutputIDInfo
An immutable view on the output info in this dataset.- Returns:
- The output info.
-
getOutputInfo
The output info in this dataset.- Returns:
- The output info.
-
getFeatureIDMap
An immutable view on the feature map.- Returns:
- The feature map.
-
getFeatureMap
The feature map.- Returns:
- The feature map.
-
getOutputFactory
Gets the output factory.- Returns:
- The output factory.
-
iterator
-
toString
-
validate
Validates that this SequenceDataset does in fact contain the supplied output type.As the output type is erased at runtime, deserialising a SequenceDataset is an unchecked operation. This method allows the user to check that the deserialised dataset is of the appropriate type, rather than seeing if the Dataset throws a
ClassCastException
when used.- Parameters:
clazz
- The class object to verify the output type against.- Returns:
- True if the output type is assignable to the class object type, false otherwise.
-
castDataset
public static <T extends Output<T>> SequenceDataset<T> castDataset(SequenceDataset<?> inputDataset, Class<T> outputType) Casts the dataset to the specified output type, assuming it is valid.If it's not valid, throws
ClassCastException
.- Type Parameters:
T
- The output type.- Parameters:
inputDataset
- The model to cast.outputType
- The output type to cast to.- Returns:
- The model cast to the correct value.
-
deserialize
public static SequenceDataset<?> deserialize(org.tribuo.protos.core.SequenceDatasetProto sequenceProto) Deserializes a sequence dataset proto into a sequence dataset.- Parameters:
sequenceProto
- The proto to deserialize.- Returns:
- The sequence dataset.
-
deserializeFromFile
Reads an instance ofSequenceDatasetProto
from the supplied path and deserializes it.- Parameters:
path
- The path to read.- Returns:
- The deserialized sequence dataset.
- Throws:
IOException
- If the path could not be read from, or the parsing failed.
-
deserializeFromStream
Reads an instance ofSequenceDatasetProto
from the supplied input stream and deserializes it.- Parameters:
is
- The input stream to read.- Returns:
- The deserialized sequence dataset.
- Throws:
IOException
- If the stream could not be read from, or the parsing failed.
-
serializeToFile
Serializes this sequence dataset to aSequenceDatasetProto
and writes it to the supplied path.- Parameters:
path
- The path to write to.- Throws:
IOException
- If the path could not be written to.
-
serializeToStream
Serializes this sequence dataset to aSequenceDatasetProto
and writes it to the supplied output stream.Does not close the stream.
- Parameters:
stream
- The output stream to write to.- Throws:
IOException
- If the stream could not be written to.
-
createDataCarrier
Constructs the data carrier for serialization.- Parameters:
featureMap
- The feature domain.outputInfo
- The output domain.- Returns:
- The serialization data carrier.
-
deserializeExamples
protected static List<SequenceExample<?>> deserializeExamples(List<org.tribuo.protos.core.SequenceExampleProto> examplesList, Class<?> outputClass, FeatureMap fmap) Deserializes a list of sequence example protos into a list of sequence examples.- Parameters:
examplesList
- The protos.outputClass
- The output class.fmap
- The feature domain.- Returns:
- The list of deserialized sequence examples.
-