Package org.tribuo.sequence
Class ImmutableSequenceDataset<T extends Output<T>>
java.lang.Object
org.tribuo.sequence.SequenceDataset<T>
org.tribuo.sequence.ImmutableSequenceDataset<T>
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>
,Serializable
,Iterable<SequenceExample<T>>
,ProtoSerializable<org.tribuo.protos.core.SequenceDatasetProto>
- Direct Known Subclasses:
MinimumCardinalitySequenceDataset
public class ImmutableSequenceDataset<T extends Output<T>>
extends SequenceDataset<T>
implements Serializable
This is a
SequenceDataset
which has an ImmutableFeatureMap
to store the feature information.
Whenever an example is added to this dataset it removes features that do not exist in the FeatureMap.
The dataset is immutable after construction (unless the examples are modified).- See Also:
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
Protobuf serialization version.protected ImmutableFeatureMap
A map from feature names to IDs for the features found in this dataset.protected ImmutableOutputInfo<T>
A map from labels to IDs for the labels found in this dataset.Fields inherited from class org.tribuo.sequence.SequenceDataset
data, outputFactory, sourceProvenance, tribuoVersion
Fields inherited from interface org.tribuo.protos.ProtoSerializable
DESERIALIZATION_METHOD_NAME, PROVENANCE_SERIALIZER
-
Constructor Summary
ModifierConstructorDescriptionImmutableSequenceDataset
(Iterable<SequenceExample<T>> dataSource, DataProvenance sourceProvenance, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo, OutputFactory<T> outputFactory) Creates a dataset from a data source.ImmutableSequenceDataset
(Iterable<SequenceExample<T>> dataSource, DataProvenance sourceProvenance, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, OutputFactory<T> outputFactory) Creates a dataset from a data source.protected
ImmutableSequenceDataset
(DataProvenance sourceProvenance, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo) This is dangerous, and should not be used unless you've overridden everything in ImmutableSequenceDataset.protected
ImmutableSequenceDataset
(DataProvenance sourceProvenance, OutputFactory<T> outputFactory) If you call this it's your job to setup outputIDInfo and featureIDMap.protected
ImmutableSequenceDataset
(DataProvenance provenance, OutputFactory<T> factory, String tribuoVersion, ImmutableFeatureMap fmap, ImmutableOutputInfo<T> outputInfo, List<SequenceExample<T>> examples) Deserialization constructor.ImmutableSequenceDataset
(SequenceDataSource<T> dataSource, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo) Creates a dataset from a data source, using the specified output and feature domains.ImmutableSequenceDataset
(SequenceDataSource<T> dataSource, SequenceModel<T> model) Creates a dataset from a data source, taking the output and feature domains from the supplied model. -
Method Summary
Modifier and TypeMethodDescriptionprotected void
add
(SequenceExample<T> ex) Adds aSequenceExample
to the dataset, which will insert feature ids, remove unknown features and sort the examples by the feature ids.protected void
add
(SequenceExample<T> ex, Merger merger) Adds aSequenceExample
to the dataset, which will insert feature ids, remove unknown features and sort the examples by the feature ids.static <T extends Output<T>>
ImmutableSequenceDataset<T>copyDataset
(SequenceDataset<T> dataset) Creates an immutable deep copy of the supplied dataset.static <T extends Output<T>>
ImmutableSequenceDataset<T>copyDataset
(SequenceDataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo) Creates an immutable deep copy of the supplied dataset, using a different feature and output map.static <T extends Output<T>>
ImmutableSequenceDataset<T>copyDataset
(SequenceDataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, Merger merger) Creates an immutable deep copy of the supplied dataset.static ImmutableSequenceDataset<?>
deserializeFromProto
(int version, String className, com.google.protobuf.Any message) Deserialization factory.An immutable view on the feature map.The feature map.An immutable view on the output info in this dataset.The output info in this dataset.Gets the set of labels that occur in the examples in this dataset.org.tribuo.protos.core.SequenceDatasetProto
Serializes this object to a protobuf.toString()
Methods inherited from class org.tribuo.sequence.SequenceDataset
castDataset, createDataCarrier, deserialize, deserializeExamples, deserializeFromFile, deserializeFromStream, getData, getExample, getFlatDataset, getOutputFactory, getSourceDescription, getSourceProvenance, iterator, serializeToFile, serializeToStream, size, validate
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Field Details
-
CURRENT_VERSION
public static final int CURRENT_VERSIONProtobuf serialization version.- See Also:
-
outputIDInfo
A map from labels to IDs for the labels found in this dataset. -
featureIDMap
A map from feature names to IDs for the features found in this dataset.
-
-
Constructor Details
-
ImmutableSequenceDataset
If you call this it's your job to setup outputIDInfo and featureIDMap.- Parameters:
sourceProvenance
- A description of the dataset including preprocessing steps.outputFactory
- The output factory.
-
ImmutableSequenceDataset
Creates a dataset from a data source, taking the output and feature domains from the supplied model.- Parameters:
dataSource
- The input data.model
- The model to use for the feature and output domains.
-
ImmutableSequenceDataset
public ImmutableSequenceDataset(SequenceDataSource<T> dataSource, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo) Creates a dataset from a data source, using the specified output and feature domains.- Parameters:
dataSource
- The input data.featureIDMap
- The feature domain.outputIDInfo
- The output domain.
-
ImmutableSequenceDataset
public ImmutableSequenceDataset(Iterable<SequenceExample<T>> dataSource, DataProvenance sourceProvenance, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo, OutputFactory<T> outputFactory) Creates a dataset from a data source. This method will create the output and feature ID maps that are needed for training and evaluating classifiers.- Parameters:
dataSource
- The input data.sourceProvenance
- A description of the data.featureIDMap
- The feature map, used to remove unknown features.outputIDInfo
- The output map.outputFactory
- The output factory.
-
ImmutableSequenceDataset
public ImmutableSequenceDataset(Iterable<SequenceExample<T>> dataSource, DataProvenance sourceProvenance, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, OutputFactory<T> outputFactory) Creates a dataset from a data source.- Parameters:
dataSource
- The input data.sourceProvenance
- A description of the data.featureIDMap
- The feature id map, used to remove unknown features.outputIDInfo
- The output id map.outputFactory
- The output factory.
-
ImmutableSequenceDataset
protected ImmutableSequenceDataset(DataProvenance sourceProvenance, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo) This is dangerous, and should not be used unless you've overridden everything in ImmutableSequenceDataset.- Parameters:
sourceProvenance
- A description of the data, including all preprocessing.featureIDMap
- The feature id map, used to remove unknown features.outputIDInfo
- The output id map.
-
ImmutableSequenceDataset
protected ImmutableSequenceDataset(DataProvenance provenance, OutputFactory<T> factory, String tribuoVersion, ImmutableFeatureMap fmap, ImmutableOutputInfo<T> outputInfo, List<SequenceExample<T>> examples) Deserialization constructor.- Parameters:
provenance
- The source provenance.factory
- The output factory.tribuoVersion
- The tribuo version.fmap
- The feature id map.outputInfo
- The output id info.examples
- The examples.
-
-
Method Details
-
deserializeFromProto
public static ImmutableSequenceDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) throws com.google.protobuf.InvalidProtocolBufferException Deserialization factory.- Parameters:
version
- The serialized object version.className
- The class name.message
- The serialized data.- Returns:
- The deserialized object.
- Throws:
com.google.protobuf.InvalidProtocolBufferException
- If the protobuf could not be parsed from themessage
.
-
add
Adds aSequenceExample
to the dataset, which will insert feature ids, remove unknown features and sort the examples by the feature ids.- Parameters:
ex
- The example to add.
-
add
Adds aSequenceExample
to the dataset, which will insert feature ids, remove unknown features and sort the examples by the feature ids.- Parameters:
ex
- The example to add.merger
- The merger to use to remove duplicate features.
-
getOutputs
Description copied from class:SequenceDataset
Gets the set of labels that occur in the examples in this dataset.- Specified by:
getOutputs
in classSequenceDataset<T extends Output<T>>
- Returns:
- the set of labels that occur in the examples in this dataset.
-
getFeatureIDMap
Description copied from class:SequenceDataset
An immutable view on the feature map.- Specified by:
getFeatureIDMap
in classSequenceDataset<T extends Output<T>>
- Returns:
- The feature map.
-
getFeatureMap
Description copied from class:SequenceDataset
The feature map.- Specified by:
getFeatureMap
in classSequenceDataset<T extends Output<T>>
- Returns:
- The feature map.
-
getOutputIDInfo
Description copied from class:SequenceDataset
An immutable view on the output info in this dataset.- Specified by:
getOutputIDInfo
in classSequenceDataset<T extends Output<T>>
- Returns:
- The output info.
-
getOutputInfo
Description copied from class:SequenceDataset
The output info in this dataset.- Specified by:
getOutputInfo
in classSequenceDataset<T extends Output<T>>
- Returns:
- The output info.
-
toString
- Overrides:
toString
in classSequenceDataset<T extends Output<T>>
-
serialize
public org.tribuo.protos.core.SequenceDatasetProto serialize()Description copied from interface:ProtoSerializable
Serializes this object to a protobuf.- Specified by:
serialize
in interfaceProtoSerializable<T extends Output<T>>
- Returns:
- The protobuf.
-
getProvenance
-
copyDataset
public static <T extends Output<T>> ImmutableSequenceDataset<T> copyDataset(SequenceDataset<T> dataset) Creates an immutable deep copy of the supplied dataset.- Type Parameters:
T
- The type of output.- Parameters:
dataset
- The dataset to copy.- Returns:
- An immutable copy of the dataset.
-
copyDataset
public static <T extends Output<T>> ImmutableSequenceDataset<T> copyDataset(SequenceDataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo) Creates an immutable deep copy of the supplied dataset, using a different feature and output map.- Type Parameters:
T
- The type of output.- Parameters:
dataset
- The dataset to copy.featureIDMap
- The new feature map to use. Removes features which are not found in this map.outputIDInfo
- The new output info to use.- Returns:
- An immutable copy of the dataset.
-
copyDataset
public static <T extends Output<T>> ImmutableSequenceDataset<T> copyDataset(SequenceDataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, Merger merger) Creates an immutable deep copy of the supplied dataset.- Type Parameters:
T
- The type of output.- Parameters:
dataset
- The dataset to copy.featureIDMap
- The new feature map to use. Removes features which are not found in this map.outputIDInfo
- The new output info to use.merger
- The merge function to use to reduce features given new ids.- Returns:
- An immutable copy of the dataset.
-