Class Dataset<T extends Output<T>>
- Type Parameters:
T
- the type of the features in the data set.
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>
,Serializable
,Iterable<Example<T>>
,ProtoSerializable<org.tribuo.protos.core.DatasetProto>
- Direct Known Subclasses:
ImmutableDataset
,MutableDataset
Subclass MutableDataset
rather than this class.
- See Also:
-
Field Summary
Modifier and TypeFieldDescriptionThe data in this data set.protected int[]
The indices of the shuffled order.protected final OutputFactory<T>
A factory for makingOutputInfo
andOutput
of the appropriate type.protected final DataProvenance
The provenance of the data source, extracted on construction.protected final String
The Tribuo version which originally created this datasetFields inherited from interface org.tribuo.protos.ProtoSerializable
DESERIALIZATION_METHOD_NAME, PROVENANCE_SERIALIZER
-
Constructor Summary
ModifierConstructorDescriptionprotected
Dataset
(DataSource<T> dataSource) Creates a dataset.protected
Dataset
(DataProvenance provenance, OutputFactory<T> outputFactory) Creates a dataset.protected
Dataset
(DataProvenance provenance, OutputFactory<T> outputFactory, String tribuoVersion) Creates a dataset. -
Method Summary
Modifier and TypeMethodDescriptioncastDataset
(Dataset<?> inputDataset, Class<T> outputType) Casts the dataset to the specified output type, assuming it is valid.protected DatasetDataCarrier<T>
createDataCarrier
(FeatureMap featureMap, OutputInfo<T> outputInfo) Constructs the data carrier for serialization.protected DatasetDataCarrier<T>
createDataCarrier
(FeatureMap featureMap, OutputInfo<T> outputInfo, List<com.oracle.labs.mlrg.olcut.provenance.ObjectProvenance> transformationProvenances) Constructs the data carrier for serialization.createTransformers
(TransformationMap transformations) Takes aTransformationMap
and converts it into aTransformerMap
by observing all the values in this dataset.createTransformers
(TransformationMap transformations, boolean includeImplicitZeroFeatures) Takes aTransformationMap
and converts it into aTransformerMap
by observing all the values in this dataset.static Dataset<?>
deserialize
(org.tribuo.protos.core.DatasetProto datasetProto) Deserializes a dataset proto into a dataset.deserializeExamples
(List<org.tribuo.protos.core.ExampleProto> examplesList, Class<?> outputClass, FeatureMap fmap) Deserializes a list of example protos into a list of examples.static Dataset<?>
deserializeFromFile
(Path path) Reads an instance ofDatasetProto
from the supplied path and deserializes it.static Dataset<?>
Reads an instance ofDatasetProto
from the supplied input stream and deserializes it.getData()
Gets the examples as an unmodifiable list.getExample
(int index) Gets the example at the supplied index.abstract ImmutableFeatureMap
Returns or generates anImmutableFeatureMap
.abstract FeatureMap
Returns this dataset'sFeatureMap
.Gets the output factory this dataset contains.abstract ImmutableOutputInfo<T>
Returns or generates anImmutableOutputInfo
.abstract OutputInfo<T>
Returns this dataset'sOutputInfo
.Gets the set of outputs that occur in the examples in this dataset.A String description of this dataset.The provenance of the data this Dataset contains.iterator()
void
serializeToFile
(Path path) Serializes this dataset to aDatasetProto
and writes it to the supplied path.void
serializeToStream
(OutputStream stream) Serializes this dataset to aDatasetProto
and writes it to the supplied output stream.void
shuffle
(boolean shuffle) Shuffles the indices, or stops shuffling them.int
size()
Gets the size of the data set.toString()
boolean
Validates that this Dataset does in fact contain the supplied output type.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
Methods inherited from interface org.tribuo.protos.ProtoSerializable
serialize
Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable
getProvenance
-
Field Details
-
data
The data in this data set. -
sourceProvenance
The provenance of the data source, extracted on construction. -
outputFactory
A factory for makingOutputInfo
andOutput
of the appropriate type. -
tribuoVersion
The Tribuo version which originally created this dataset -
indices
protected int[] indicesThe indices of the shuffled order.
-
-
Constructor Details
-
Dataset
Creates a dataset.- Parameters:
provenance
- A description of the data, including preprocessing steps.outputFactory
- The output factory.
-
Dataset
Creates a dataset.- Parameters:
provenance
- A description of the data, including preprocessing steps.outputFactory
- The output factory.tribuoVersion
- The Tribuo version.
-
Dataset
Creates a dataset.- Parameters:
dataSource
- the DataSource to use.
-
-
Method Details
-
deserialize
Deserializes a dataset proto into a dataset.- Parameters:
datasetProto
- The proto to deserialize.- Returns:
- The dataset.
-
deserializeFromFile
Reads an instance ofDatasetProto
from the supplied path and deserializes it.- Parameters:
path
- The path to read.- Returns:
- The deserialized dataset.
- Throws:
IOException
- If the path could not be read from, or the parsing failed.
-
deserializeFromStream
Reads an instance ofDatasetProto
from the supplied input stream and deserializes it.- Parameters:
is
- The input stream to read.- Returns:
- The deserialized dataset.
- Throws:
IOException
- If the stream could not be read from, or the parsing failed.
-
serializeToFile
Serializes this dataset to aDatasetProto
and writes it to the supplied path.- Parameters:
path
- The path to write to.- Throws:
IOException
- If the path could not be written to.
-
serializeToStream
Serializes this dataset to aDatasetProto
and writes it to the supplied output stream.Does not close the stream.
- Parameters:
stream
- The output stream to write to.- Throws:
IOException
- If the stream could not be written to.
-
getSourceDescription
A String description of this dataset.- Returns:
- The description
-
getSourceProvenance
The provenance of the data this Dataset contains.- Returns:
- The data provenance.
-
getData
Gets the examples as an unmodifiable list. This list will throw an UnsupportedOperationException if any elements are added to it.In other words, using the following to add additional examples to this dataset with throw an exception:
dataset.getData().add(example)
Instead, useMutableDataset.add(Example)
.- Returns:
- The unmodifiable example list.
-
getOutputFactory
Gets the output factory this dataset contains.- Returns:
- The output factory.
-
getOutputs
Gets the set of outputs that occur in the examples in this dataset.- Returns:
- the set of outputs that occur in the examples in this dataset.
-
getExample
Gets the example at the supplied index.Throws IllegalArgumentException if the index is invalid or outside the bounds.
- Parameters:
index
- The index of the example.- Returns:
- The example.
-
size
public int size()Gets the size of the data set.- Returns:
- the size of the data set.
-
shuffle
public void shuffle(boolean shuffle) Shuffles the indices, or stops shuffling them.The shuffle only affects the iterator, it does not affect
getExample(int)
.Multiple calls with the argument true will shuffle the dataset multiple times. The RNG is shared across all Dataset instances, so methods which access it are synchronized.
Using this method will prevent the provenance system from tracking the exact state of the dataset, which may be important for trainers which depend on the example order, like those using stochastic gradient descent.
- Parameters:
shuffle
- If true shuffle the data.
-
getOutputIDInfo
Returns or generates anImmutableOutputInfo
.- Returns:
- An immutable output info.
-
getOutputInfo
Returns this dataset'sOutputInfo
.- Returns:
- The output info.
-
getFeatureIDMap
Returns or generates anImmutableFeatureMap
.- Returns:
- An immutable feature map with id numbers.
-
getFeatureMap
Returns this dataset'sFeatureMap
.- Returns:
- The feature map from this dataset.
-
iterator
-
toString
-
createTransformers
Takes aTransformationMap
and converts it into aTransformerMap
by observing all the values in this dataset.Does not mutate the dataset, if you wish to apply the TransformerMap, use
MutableDataset.transform(org.tribuo.transform.TransformerMap)
orTransformerMap.transformDataset(org.tribuo.Dataset<T>)
.TransformerMaps operate on feature values which are present, sparse values are ignored and not transformed. If the zeros should be transformed, call
MutableDataset.densify()
on the datasets before applying a transformer.This method calls
createTransformers(TransformationMap, boolean)
withincludeImplicitZeroFeatures
set to false, thus ignoring implicitly zero features when fitting the transformations. This is the default behaviour in Tribuo 4.0, but causes erroneous behaviour inIDFTransformation
so should be avoided with that transformation. Seeorg.tribuo.transform
for a more detailed discussion of densify and includeImplicitZeroFeatures.Throws
IllegalArgumentException
if the TransformationMap object has regexes which apply to multiple features.- Parameters:
transformations
- The transformations to fit.- Returns:
- A TransformerMap which can apply the transformations to a dataset.
-
createTransformers
public TransformerMap createTransformers(TransformationMap transformations, boolean includeImplicitZeroFeatures) Takes aTransformationMap
and converts it into aTransformerMap
by observing all the values in this dataset.Does not mutate the dataset, if you wish to apply the TransformerMap, use
MutableDataset.transform(org.tribuo.transform.TransformerMap)
orTransformerMap.transformDataset(org.tribuo.Dataset<T>)
.TransformerMaps operate on feature values which are present, sparse values are ignored and not transformed. If the zeros should be transformed, call
MutableDataset.densify()
on the datasets before applying a transformer. Seeorg.tribuo.transform
for a more detailed discussion of densify and includeImplicitZeroFeatures.Throws
IllegalArgumentException
if the TransformationMap object has regexes which apply to multiple features.- Parameters:
transformations
- The transformations to fit.includeImplicitZeroFeatures
- Use the implicit zero feature values to construct the transformations.- Returns:
- A TransformerMap which can apply the transformations to a dataset.
-
createDataCarrier
Constructs the data carrier for serialization.- Parameters:
featureMap
- The feature domain.outputInfo
- The output domain.- Returns:
- The serialization data carrier.
-
createDataCarrier
protected DatasetDataCarrier<T> createDataCarrier(FeatureMap featureMap, OutputInfo<T> outputInfo, List<com.oracle.labs.mlrg.olcut.provenance.ObjectProvenance> transformationProvenances) Constructs the data carrier for serialization.- Parameters:
featureMap
- The feature domain.outputInfo
- The output domain.transformationProvenances
- The transformation provenances, must be non-null, but can be empty.- Returns:
- The serialization data carrier.
-
validate
Validates that this Dataset does in fact contain the supplied output type.As the output type is erased at runtime, deserialising a Dataset is an unchecked operation. This method allows the user to check that the deserialised dataset is of the appropriate type, rather than seeing if the Dataset throws a
ClassCastException
when used.- Parameters:
clazz
- The class object to verify the output type against.- Returns:
- True if the output type is assignable to the class object type, false otherwise.
-
castDataset
public static <T extends Output<T>> Dataset<T> castDataset(Dataset<?> inputDataset, Class<T> outputType) Casts the dataset to the specified output type, assuming it is valid.If it's not valid, throws
ClassCastException
.- Type Parameters:
T
- The output type.- Parameters:
inputDataset
- The model to cast.outputType
- The output type to cast to.- Returns:
- The model cast to the correct value.
-
deserializeExamples
protected static List<Example<?>> deserializeExamples(List<org.tribuo.protos.core.ExampleProto> examplesList, Class<?> outputClass, FeatureMap fmap) Deserializes a list of example protos into a list of examples.- Parameters:
examplesList
- The protos.outputClass
- The output class.fmap
- The feature domain.- Returns:
- The list of deserialized examples.
-