Package org.tribuo
Class ImmutableDataset<T extends Output<T>>
java.lang.Object
org.tribuo.Dataset<T>
org.tribuo.ImmutableDataset<T>
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>
,Serializable
,Iterable<Example<T>>
,ProtoSerializable<org.tribuo.protos.core.DatasetProto>
- Direct Known Subclasses:
DatasetView
,MinimumCardinalityDataset
,SelectedFeatureDataset
This is a
Dataset
which has an ImmutableFeatureMap
to store the feature information.
Whenever an example is added to this dataset it removes features that do not exist in the FeatureMap
.
The dataset is immutable after construction (unless the examples are modified).
This class is mostly for performance optimisations inside the framework, and should not generally be used by external code.
- See Also:
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
Protobuf serialization version.protected final boolean
If true, instead of throwing an exception when an invalidExample
is encountered, this Dataset will log a warning and drop it.protected ImmutableFeatureMap
A map from feature names to IDs for the features found in this dataset.protected ImmutableOutputInfo<T>
Output information, and id numbers for outputs found in this dataset.Fields inherited from class org.tribuo.Dataset
data, indices, outputFactory, sourceProvenance, tribuoVersion
Fields inherited from interface org.tribuo.protos.ProtoSerializable
DESERIALIZATION_METHOD_NAME, PROVENANCE_SERIALIZER
-
Constructor Summary
ModifierConstructorDescriptionImmutableDataset
(Iterable<Example<T>> dataSource, DataProvenance description, OutputFactory<T> outputFactory, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo, boolean dropInvalidExamples) Creates a dataset from a data source.ImmutableDataset
(Iterable<Example<T>> dataSource, DataProvenance description, OutputFactory<T> outputFactory, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, boolean dropInvalidExamples) Creates a dataset from a data source.ImmutableDataset
(DataSource<T> dataSource, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo, boolean dropInvalidExamples) Creates a dataset from a data source.ImmutableDataset
(DataSource<T> dataSource, Model<T> model, boolean dropInvalidExamples) Creates a dataset from a data source.protected
ImmutableDataset
(DataProvenance description, OutputFactory<T> outputFactory) If you call this it's your job to setup outputMap, featureIDMap and fill it with examples.protected
ImmutableDataset
(DataProvenance provenance, OutputFactory<T> factory, String tribuoVersion, ImmutableFeatureMap fmap, ImmutableOutputInfo<T> outputInfo, List<Example<T>> examples, boolean dropInvalidExamples) Deserialization constructor.protected
ImmutableDataset
(DataProvenance description, OutputFactory<T> outputFactory, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo) This is dangerous, and should not be used unless you've overridden everything in ImmutableDataset. -
Method Summary
Modifier and TypeMethodDescriptionprotected void
Adds anExample
to the dataset, which will remove features with unknown names.protected void
Adds aExample
to the dataset, which will insert feature ids, remove unknown features and sort the examples by the feature ids (merging duplicate ids).static <T extends Output<T>>
ImmutableDataset<T>copyDataset
(Dataset<T> dataset) Creates an immutable deep copy of the supplied dataset.static <T extends Output<T>>
ImmutableDataset<T>copyDataset
(Dataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo) Creates an immutable deep copy of the supplied dataset, using a different feature and output map.static <T extends Output<T>>
ImmutableDataset<T>copyDataset
(Dataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, Merger merger) Creates an immutable deep copy of the supplied dataset.static ImmutableDataset<?>
deserializeFromProto
(int version, String className, com.google.protobuf.Any message) Deserialization factory.boolean
Returns true if this immutable dataset dropped any invalid examples on construction.Returns or generates anImmutableFeatureMap
.Returns this dataset'sFeatureMap
.Returns or generates anImmutableOutputInfo
.Returns this dataset'sOutputInfo
.Gets the set of outputs that occur in the examples in this dataset.static <T extends Output<T>>
ImmutableDataset<T>hashFeatureMap
(Dataset<T> dataset, Hasher hasher) Creates an immutable shallow copy of the supplied dataset, using the hasher to generate aHashedFeatureMap
which transparently maps from the feature name to the hashed variant.org.tribuo.protos.core.DatasetProto
Serializes this object to a protobuf.toString()
Methods inherited from class org.tribuo.Dataset
castDataset, createDataCarrier, createDataCarrier, createTransformers, createTransformers, deserialize, deserializeExamples, deserializeFromFile, deserializeFromStream, getData, getExample, getOutputFactory, getSourceDescription, getSourceProvenance, iterator, serializeToFile, serializeToStream, shuffle, size, validate
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Field Details
-
CURRENT_VERSION
public static final int CURRENT_VERSIONProtobuf serialization version.- See Also:
-
outputIDInfo
Output information, and id numbers for outputs found in this dataset. -
featureIDMap
A map from feature names to IDs for the features found in this dataset. -
dropInvalidExamples
protected final boolean dropInvalidExamplesIf true, instead of throwing an exception when an invalidExample
is encountered, this Dataset will log a warning and drop it.
-
-
Constructor Details
-
ImmutableDataset
If you call this it's your job to setup outputMap, featureIDMap and fill it with examples.Note: Sets dropInvalidExamples to false.
- Parameters:
description
- A description of the input data (including preprocessing steps).outputFactory
- The factory for this output type.
-
ImmutableDataset
Creates a dataset from a data source. It copies the feature and output maps from the supplied model.- Parameters:
dataSource
- The examples.model
- A model to extract feature and output maps from.dropInvalidExamples
- If true, instead of throwing an exception when an invalidExample
is encountered, this Dataset will log a warning and drop it.
-
ImmutableDataset
public ImmutableDataset(DataSource<T> dataSource, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo, boolean dropInvalidExamples) Creates a dataset from a data source. Creates immutable feature and output maps from the supplied ones.- Parameters:
dataSource
- The examples.featureIDMap
- The feature map.outputIDInfo
- The output map.dropInvalidExamples
- If true, instead of throwing an exception when an invalidExample
is encountered, this Dataset will log a warning and drop it.
-
ImmutableDataset
public ImmutableDataset(Iterable<Example<T>> dataSource, DataProvenance description, OutputFactory<T> outputFactory, FeatureMap featureIDMap, OutputInfo<T> outputIDInfo, boolean dropInvalidExamples) Creates a dataset from a data source. Creates immutable feature and output maps from the supplied ones.- Parameters:
dataSource
- The examples.description
- A description of the input data (including preprocessing steps).outputFactory
- The output factory.featureIDMap
- The feature id map, used to remove unknown features.outputIDInfo
- The output id map.dropInvalidExamples
- If true, instead of throwing an exception when an invalidExample
is encountered, this Dataset will log a warning and drop it.
-
ImmutableDataset
public ImmutableDataset(Iterable<Example<T>> dataSource, DataProvenance description, OutputFactory<T> outputFactory, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, boolean dropInvalidExamples) Creates a dataset from a data source.- Parameters:
dataSource
- The examples.description
- A description of the input data (including preprocessing steps).outputFactory
- The factory for this output type.featureIDMap
- The feature id map, used to remove unknown features.outputIDInfo
- The output id map.dropInvalidExamples
- If true, instead of throwing an exception when an invalidExample
is encountered, this Dataset will log a warning and drop it.
-
ImmutableDataset
protected ImmutableDataset(DataProvenance description, OutputFactory<T> outputFactory, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo) This is dangerous, and should not be used unless you've overridden everything in ImmutableDataset.Note: Sets dropInvalidExamples to false.
- Parameters:
description
- A description of the data you're going to add to this dataset.outputFactory
- The factory for this output type.featureIDMap
- The feature id map, used to remove unknown features.outputIDInfo
- The output id map.
-
ImmutableDataset
protected ImmutableDataset(DataProvenance provenance, OutputFactory<T> factory, String tribuoVersion, ImmutableFeatureMap fmap, ImmutableOutputInfo<T> outputInfo, List<Example<T>> examples, boolean dropInvalidExamples) Deserialization constructor.- Parameters:
provenance
- The source provenance.factory
- The output factory.tribuoVersion
- The tribuo version.fmap
- The feature id map.outputInfo
- The output id info.examples
- The examples.dropInvalidExamples
- Should invalid examples be dropped when added?
-
-
Method Details
-
deserializeFromProto
public static ImmutableDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) throws com.google.protobuf.InvalidProtocolBufferException Deserialization factory.- Parameters:
version
- The serialized object version.className
- The class name.message
- The serialized data.- Returns:
- The deserialized object.
- Throws:
com.google.protobuf.InvalidProtocolBufferException
- If the protobuf could not be parsed from themessage
.
-
add
Adds anExample
to the dataset, which will remove features with unknown names.- Parameters:
ex
- AnExample
to add to the dataset.
-
add
Adds aExample
to the dataset, which will insert feature ids, remove unknown features and sort the examples by the feature ids (merging duplicate ids).- Parameters:
ex
- The example to add.merger
- TheMerger
to use.
-
getOutputs
Description copied from class:Dataset
Gets the set of outputs that occur in the examples in this dataset.- Specified by:
getOutputs
in classDataset<T extends Output<T>>
- Returns:
- the set of outputs that occur in the examples in this dataset.
-
getFeatureIDMap
Description copied from class:Dataset
Returns or generates anImmutableFeatureMap
.- Specified by:
getFeatureIDMap
in classDataset<T extends Output<T>>
- Returns:
- An immutable feature map with id numbers.
-
getFeatureMap
Description copied from class:Dataset
Returns this dataset'sFeatureMap
.- Specified by:
getFeatureMap
in classDataset<T extends Output<T>>
- Returns:
- The feature map from this dataset.
-
getOutputIDInfo
Description copied from class:Dataset
Returns or generates anImmutableOutputInfo
.- Specified by:
getOutputIDInfo
in classDataset<T extends Output<T>>
- Returns:
- An immutable output info.
-
getOutputInfo
Description copied from class:Dataset
Returns this dataset'sOutputInfo
.- Specified by:
getOutputInfo
in classDataset<T extends Output<T>>
- Returns:
- The output info.
-
getDropInvalidExamples
public boolean getDropInvalidExamples()Returns true if this immutable dataset dropped any invalid examples on construction.- Returns:
- True if it drops invalid examples.
-
toString
-
getProvenance
-
serialize
public org.tribuo.protos.core.DatasetProto serialize()Description copied from interface:ProtoSerializable
Serializes this object to a protobuf.- Specified by:
serialize
in interfaceProtoSerializable<T extends Output<T>>
- Returns:
- The protobuf.
-
copyDataset
Creates an immutable deep copy of the supplied dataset.- Type Parameters:
T
- The type of output.- Parameters:
dataset
- The dataset to copy.- Returns:
- An immutable copy of the dataset.
-
copyDataset
public static <T extends Output<T>> ImmutableDataset<T> copyDataset(Dataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo) Creates an immutable deep copy of the supplied dataset, using a different feature and output map.- Type Parameters:
T
- The type of output.- Parameters:
dataset
- The dataset to copy.featureIDMap
- The new feature map to use. Removes features which are not found in this map.outputIDInfo
- The new output info to use.- Returns:
- An immutable copy of the dataset.
-
copyDataset
public static <T extends Output<T>> ImmutableDataset<T> copyDataset(Dataset<T> dataset, ImmutableFeatureMap featureIDMap, ImmutableOutputInfo<T> outputIDInfo, Merger merger) Creates an immutable deep copy of the supplied dataset.- Type Parameters:
T
- The type of output.- Parameters:
dataset
- The dataset to copy.featureIDMap
- The new feature map to use. Removes features which are not found in this map.outputIDInfo
- The new output info to use.merger
- The merge function to use to reduce features given new ids.- Returns:
- An immutable copy of the dataset.
-
hashFeatureMap
public static <T extends Output<T>> ImmutableDataset<T> hashFeatureMap(Dataset<T> dataset, Hasher hasher) Creates an immutable shallow copy of the supplied dataset, using the hasher to generate aHashedFeatureMap
which transparently maps from the feature name to the hashed variant.- Type Parameters:
T
- The type of output.- Parameters:
dataset
- The dataset to copy.hasher
- The hashing function to use.- Returns:
- An immutable copy of the dataset.
-