Package org.tribuo.sequence
Class MutableSequenceDataset<T extends Output<T>>
java.lang.Object
org.tribuo.sequence.SequenceDataset<T>
org.tribuo.sequence.MutableSequenceDataset<T>
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>
,Serializable
,Iterable<SequenceExample<T>>
,ProtoSerializable<org.tribuo.protos.core.SequenceDatasetProto>
A MutableSequenceDataset is a
SequenceDataset
with a MutableFeatureMap
which grows over time.
Whenever an SequenceExample
is added to the dataset.- See Also:
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
Protobuf serialization version.protected boolean
Does this dataset have a dense feature space.protected final MutableFeatureMap
A map from feature names to IDs for the features found in this dataset.protected final MutableOutputInfo<T>
A map from labels to IDs for the labels found in this dataset.Fields inherited from class org.tribuo.sequence.SequenceDataset
data, outputFactory, sourceProvenance, tribuoVersion
Fields inherited from interface org.tribuo.protos.ProtoSerializable
DESERIALIZATION_METHOD_NAME, PROVENANCE_SERIALIZER
-
Constructor Summary
ConstructorDescriptionMutableSequenceDataset
(Iterable<SequenceExample<T>> dataSource, DataProvenance sourceProvenance, OutputFactory<T> outputFactory) Creates a dataset from a data source.MutableSequenceDataset
(DataProvenance sourceProvenance, OutputFactory<T> outputFactory) Creates an empty sequence dataset.MutableSequenceDataset
(ImmutableSequenceDataset<T> dataset) Copies the immutable dataset into a mutable dataset.MutableSequenceDataset
(SequenceDataSource<T> dataSource) Builds a dataset from the supplied data source. -
Method Summary
Modifier and TypeMethodDescriptionvoid
add
(SequenceExample<T> ex) Adds aSequenceExample
to this dataset.void
addAll
(Collection<SequenceExample<T>> collection) Adds all the SequenceExamples in the supplied collection to this dataset.void
clear()
Clears all the examples out of this dataset, and flushes the FeatureMap, OutputInfo, and transform provenances.void
densify()
Iterates through the examples, converting implicit zeros into explicit zeros.static MutableSequenceDataset<?>
deserializeFromProto
(int version, String className, com.google.protobuf.Any message) Deserialization factory.An immutable view on the feature map.The feature map.An immutable view on the output info in this dataset.The output info in this dataset.Gets the set of labels that occur in the examples in this dataset.boolean
isDense()
Is the dataset dense (i.e., do all features in the domain have a value in each example).org.tribuo.protos.core.SequenceDatasetProto
Serializes this object to a protobuf.toString()
Methods inherited from class org.tribuo.sequence.SequenceDataset
castDataset, createDataCarrier, deserialize, deserializeExamples, deserializeFromFile, deserializeFromStream, getData, getExample, getFlatDataset, getOutputFactory, getSourceDescription, getSourceProvenance, iterator, serializeToFile, serializeToStream, size, validate
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Field Details
-
CURRENT_VERSION
public static final int CURRENT_VERSIONProtobuf serialization version.- See Also:
-
outputInfo
A map from labels to IDs for the labels found in this dataset. -
featureMap
A map from feature names to IDs for the features found in this dataset. -
dense
protected boolean denseDoes this dataset have a dense feature space.
-
-
Constructor Details
-
MutableSequenceDataset
Creates an empty sequence dataset.- Parameters:
sourceProvenance
- A description of the input data, including preprocessing steps.outputFactory
- The output factory.
-
MutableSequenceDataset
public MutableSequenceDataset(Iterable<SequenceExample<T>> dataSource, DataProvenance sourceProvenance, OutputFactory<T> outputFactory) Creates a dataset from a data source. This method will create the output and feature ID maps that are needed for training and evaluating classifiers.- Parameters:
dataSource
- The input data.sourceProvenance
- A description of the data, including preprocessing steps.outputFactory
- The output factory.
-
MutableSequenceDataset
Builds a dataset from the supplied data source.- Parameters:
dataSource
- The data source.
-
MutableSequenceDataset
Copies the immutable dataset into a mutable dataset.This should be infrequently used and mostly exists for the ViterbiTrainer.
- Parameters:
dataset
- The dataset to copy.
-
-
Method Details
-
deserializeFromProto
public static MutableSequenceDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) throws com.google.protobuf.InvalidProtocolBufferException Deserialization factory.- Parameters:
version
- The serialized object version.className
- The class name.message
- The serialized data.- Returns:
- The deserialized object.
- Throws:
com.google.protobuf.InvalidProtocolBufferException
- If the protobuf could not be parsed from themessage
.
-
clear
public void clear()Clears all the examples out of this dataset, and flushes the FeatureMap, OutputInfo, and transform provenances. -
add
Adds aSequenceExample
to this dataset.It also canonicalises the reference to each feature's name (i.e., replacing the reference to a feature's name with the canonical one stored in this Dataset's
VariableInfo
). This greatly reduces the memory footprint.- Parameters:
ex
- The example to add.
-
addAll
Adds all the SequenceExamples in the supplied collection to this dataset.- Parameters:
collection
- The collection of SequenceExamples.
-
getOutputs
Description copied from class:SequenceDataset
Gets the set of labels that occur in the examples in this dataset.- Specified by:
getOutputs
in classSequenceDataset<T extends Output<T>>
- Returns:
- the set of labels that occur in the examples in this dataset.
-
getFeatureIDMap
Description copied from class:SequenceDataset
An immutable view on the feature map.- Specified by:
getFeatureIDMap
in classSequenceDataset<T extends Output<T>>
- Returns:
- The feature map.
-
getFeatureMap
Description copied from class:SequenceDataset
The feature map.- Specified by:
getFeatureMap
in classSequenceDataset<T extends Output<T>>
- Returns:
- The feature map.
-
getOutputIDInfo
Description copied from class:SequenceDataset
An immutable view on the output info in this dataset.- Specified by:
getOutputIDInfo
in classSequenceDataset<T extends Output<T>>
- Returns:
- The output info.
-
getOutputInfo
Description copied from class:SequenceDataset
The output info in this dataset.- Specified by:
getOutputInfo
in classSequenceDataset<T extends Output<T>>
- Returns:
- The output info.
-
isDense
public boolean isDense()Is the dataset dense (i.e., do all features in the domain have a value in each example).- Returns:
- True if the dataset is dense.
-
densify
public void densify()Iterates through the examples, converting implicit zeros into explicit zeros. -
toString
- Overrides:
toString
in classSequenceDataset<T extends Output<T>>
-
getProvenance
-
serialize
public org.tribuo.protos.core.SequenceDatasetProto serialize()Description copied from interface:ProtoSerializable
Serializes this object to a protobuf.- Returns:
- The protobuf.
-