Package org.tribuo
Class MutableDataset<T extends Output<T>>
java.lang.Object
org.tribuo.Dataset<T>
org.tribuo.MutableDataset<T>
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>
,Serializable
,Iterable<Example<T>>
,ProtoSerializable<org.tribuo.protos.core.DatasetProto>
A MutableDataset is a
Dataset
with a MutableFeatureMap
which grows over time.
Whenever an Example
is added to the dataset it observes each feature and output
keeping appropriate statistics in the FeatureMap
and OutputInfo
.- See Also:
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
Protobuf serialization version.protected boolean
Denotes if this dataset contains implicit zeros or not.protected final MutableFeatureMap
A map from feature names to feature info objects.protected final MutableOutputInfo<T>
Information about the outputs in this dataset.protected final List<com.oracle.labs.mlrg.olcut.provenance.ObjectProvenance>
The provenances of the transformations applied to this dataset.Fields inherited from class org.tribuo.Dataset
data, indices, outputFactory, sourceProvenance, tribuoVersion
Fields inherited from interface org.tribuo.protos.ProtoSerializable
DESERIALIZATION_METHOD_NAME, PROVENANCE_SERIALIZER
-
Constructor Summary
ConstructorDescriptionMutableDataset
(Iterable<Example<T>> dataSource, DataProvenance provenance, OutputFactory<T> outputFactory) Creates a dataset from a data source.MutableDataset
(DataSource<T> dataSource) Creates a dataset from a data source.MutableDataset
(DataProvenance sourceProvenance, OutputFactory<T> outputFactory) Creates an empty dataset. -
Method Summary
Modifier and TypeMethodDescriptionvoid
Adds an example to the dataset, which observes the output and each feature value.void
addAll
(Collection<? extends Example<T>> collection) Adds all the Examples in the supplied collection to this dataset.void
clear()
Clears all the examples out of this dataset, and flushes the FeatureMap, OutputInfo, and transform provenances.static <T extends Output<T>>
MutableDataset<T>createDeepCopy
(Dataset<T> other) Creates a deep copy of the suppliedDataset
which is mutable.void
densify()
Iterates through the examples, converting implicit zeros into explicit zeros.static MutableDataset<?>
deserializeFromProto
(int version, String className, com.google.protobuf.Any message) Deserialization factory.Returns or generates anImmutableFeatureMap
.Returns this dataset'sFeatureMap
.Returns or generates anImmutableOutputInfo
.Returns this dataset'sOutputInfo
.Gets the set of possible outputs in this dataset.boolean
isDense()
Is the dataset dense (i.e., do all features in the domain have a value in each example).void
Rebuilds the feature info by inspecting each example.void
Rebuilds the output info by inspecting each example.org.tribuo.protos.core.DatasetProto
Serializes this object to a protobuf.void
setWeights
(Map<T, Float> weights) Sets the weights in each example according to their output.toString()
void
transform
(TransformerMap transformerMap) Applies all the transformations from theTransformerMap
to this dataset.Methods inherited from class org.tribuo.Dataset
castDataset, createDataCarrier, createDataCarrier, createTransformers, createTransformers, deserialize, deserializeExamples, deserializeFromFile, deserializeFromStream, getData, getExample, getOutputFactory, getSourceDescription, getSourceProvenance, iterator, serializeToFile, serializeToStream, shuffle, size, validate
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Field Details
-
CURRENT_VERSION
public static final int CURRENT_VERSIONProtobuf serialization version.- See Also:
-
outputMap
Information about the outputs in this dataset. -
featureMap
A map from feature names to feature info objects. -
transformProvenances
The provenances of the transformations applied to this dataset. -
dense
protected boolean denseDenotes if this dataset contains implicit zeros or not.
-
-
Constructor Details
-
MutableDataset
Creates an empty dataset.- Parameters:
sourceProvenance
- A description of the input data, including preprocessing steps.outputFactory
- The output factory.
-
MutableDataset
public MutableDataset(Iterable<Example<T>> dataSource, DataProvenance provenance, OutputFactory<T> outputFactory) Creates a dataset from a data source. This method will create the output and feature maps that are needed for training and evaluating classifiers.- Parameters:
dataSource
- The examples.provenance
- A description of the input data, including preprocessing steps.outputFactory
- The output factory.
-
MutableDataset
Creates a dataset from a data source. This method creates the output and feature maps needed for training and evaluating classifiers.- Parameters:
dataSource
- The examples.
-
-
Method Details
-
deserializeFromProto
public static MutableDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) throws com.google.protobuf.InvalidProtocolBufferException Deserialization factory.- Parameters:
version
- The serialized object version.className
- The class name.message
- The serialized data.- Returns:
- The deserialized object.
- Throws:
com.google.protobuf.InvalidProtocolBufferException
- If the protobuf could not be parsed from themessage
.
-
add
Adds an example to the dataset, which observes the output and each feature value.It also canonicalises the reference to each feature's name (i.e., replacing the reference to a feature's name with the canonical one stored in this Dataset's
VariableInfo
). This greatly reduces the memory footprint.- Parameters:
ex
- The example to add.
-
addAll
Adds all the Examples in the supplied collection to this dataset.- Parameters:
collection
- The collection of Examples.
-
setWeights
Sets the weights in each example according to their output.- Parameters:
weights
- A map ofOutput
s to float weights.
-
getOutputs
Gets the set of possible outputs in this dataset.In the case of regression returns a Set containing dimension names.
- Specified by:
getOutputs
in classDataset<T extends Output<T>>
- Returns:
- The set of possible outputs.
-
getFeatureIDMap
Description copied from class:Dataset
Returns or generates anImmutableFeatureMap
.- Specified by:
getFeatureIDMap
in classDataset<T extends Output<T>>
- Returns:
- An immutable feature map with id numbers.
-
getFeatureMap
Description copied from class:Dataset
Returns this dataset'sFeatureMap
.- Specified by:
getFeatureMap
in classDataset<T extends Output<T>>
- Returns:
- The feature map from this dataset.
-
getOutputIDInfo
Description copied from class:Dataset
Returns or generates anImmutableOutputInfo
.- Specified by:
getOutputIDInfo
in classDataset<T extends Output<T>>
- Returns:
- An immutable output info.
-
getOutputInfo
Description copied from class:Dataset
Returns this dataset'sOutputInfo
.- Specified by:
getOutputInfo
in classDataset<T extends Output<T>>
- Returns:
- The output info.
-
toString
-
isDense
public boolean isDense()Is the dataset dense (i.e., do all features in the domain have a value in each example).- Returns:
- True if the dataset is dense.
-
transform
Applies all the transformations from theTransformerMap
to this dataset.- Parameters:
transformerMap
- The transformations to apply.
-
densify
public void densify()Iterates through the examples, converting implicit zeros into explicit zeros. -
clear
public void clear()Clears all the examples out of this dataset, and flushes the FeatureMap, OutputInfo, and transform provenances. -
regenerateOutputInfo
public void regenerateOutputInfo()Rebuilds the output info by inspecting each example. -
regenerateFeatureInfo
public void regenerateFeatureInfo()Rebuilds the feature info by inspecting each example. -
getProvenance
-
serialize
public org.tribuo.protos.core.DatasetProto serialize()Description copied from interface:ProtoSerializable
Serializes this object to a protobuf.- Returns:
- The protobuf.
-
createDeepCopy
Creates a deep copy of the suppliedDataset
which is mutable.Copies the individual examples using their copy method.
- Type Parameters:
T
- The output type.- Parameters:
other
- The dataset to copy.- Returns:
- A mutable deep copy of the dataset.
-