Class Dataset<T extends Output<T>>
- Type Parameters:
T
- the type of the features in the data set.
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>
,Serializable
,Iterable<Example<T>>
- Direct Known Subclasses:
ImmutableDataset
,MutableDataset
Subclass MutableDataset
rather than this class.
- See Also:
-
Field Summary
Modifier and TypeFieldDescriptionThe data in this data set.protected int[]
The indices of the shuffled order.protected final OutputFactory<T>
A factory for makingOutputInfo
andOutput
of the appropriate type.protected final DataProvenance
The provenance of the data source, extracted on construction. -
Constructor Summary
ModifierConstructorDescriptionprotected
Dataset
(DataSource<T> dataSource) Creates a dataset.protected
Dataset
(DataProvenance provenance, OutputFactory<T> outputFactory) Creates a dataset. -
Method Summary
Modifier and TypeMethodDescriptioncastDataset
(Dataset<?> inputDataset, Class<T> outputType) Casts the dataset to the specified output type, assuming it is valid.createTransformers
(TransformationMap transformations) Takes aTransformationMap
and converts it into aTransformerMap
by observing all the values in this dataset.createTransformers
(TransformationMap transformations, boolean includeImplicitZeroFeatures) Takes aTransformationMap
and converts it into aTransformerMap
by observing all the values in this dataset.getData()
Gets the examples as an unmodifiable list.getExample
(int index) Gets the example at the supplied index.abstract ImmutableFeatureMap
Returns or generates anImmutableFeatureMap
.abstract FeatureMap
Returns this dataset'sFeatureMap
.Gets the output factory this dataset contains.abstract ImmutableOutputInfo<T>
Returns or generates anImmutableOutputInfo
.abstract OutputInfo<T>
Returns this dataset'sOutputInfo
.Gets the set of outputs that occur in the examples in this dataset.A String description of this dataset.The provenance of the data this Dataset contains.iterator()
void
shuffle
(boolean shuffle) Shuffles the indices, or stops shuffling them.int
size()
Gets the size of the data set.toString()
boolean
Validates that this Dataset does in fact contain the supplied output type.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable
getProvenance
-
Field Details
-
data
The data in this data set. -
sourceProvenance
The provenance of the data source, extracted on construction. -
outputFactory
A factory for makingOutputInfo
andOutput
of the appropriate type. -
indices
protected int[] indicesThe indices of the shuffled order.
-
-
Constructor Details
-
Dataset
Creates a dataset.- Parameters:
provenance
- A description of the data, including preprocessing steps.outputFactory
- The output factory.
-
Dataset
Creates a dataset.- Parameters:
dataSource
- the DataSource to use.
-
-
Method Details
-
getSourceDescription
A String description of this dataset.- Returns:
- The description
-
getSourceProvenance
The provenance of the data this Dataset contains.- Returns:
- The data provenance.
-
getData
Gets the examples as an unmodifiable list. This list will throw an UnsupportedOperationException if any elements are added to it.In other words, using the following to add additional examples to this dataset with throw an exception:
dataset.getData().add(example)
Instead, useMutableDataset.add(Example)
.- Returns:
- The unmodifiable example list.
-
getOutputFactory
Gets the output factory this dataset contains.- Returns:
- The output factory.
-
getOutputs
Gets the set of outputs that occur in the examples in this dataset.- Returns:
- the set of outputs that occur in the examples in this dataset.
-
getExample
Gets the example at the supplied index.Throws IllegalArgumentException if the index is invalid or outside the bounds.
- Parameters:
index
- The index of the example.- Returns:
- The example.
-
size
public int size()Gets the size of the data set.- Returns:
- the size of the data set.
-
shuffle
public void shuffle(boolean shuffle) Shuffles the indices, or stops shuffling them.The shuffle only affects the iterator, it does not affect
getExample(int)
.Multiple calls with the argument true will shuffle the dataset multiple times. The RNG is shared across all Dataset instances, so methods which access it are synchronized.
Using this method will prevent the provenance system from tracking the exact state of the dataset, which may be important for trainers which depend on the example order, like those using stochastic gradient descent.
- Parameters:
shuffle
- If true shuffle the data.
-
getOutputIDInfo
Returns or generates anImmutableOutputInfo
.- Returns:
- An immutable output info.
-
getOutputInfo
Returns this dataset'sOutputInfo
.- Returns:
- The output info.
-
getFeatureIDMap
Returns or generates anImmutableFeatureMap
.- Returns:
- An immutable feature map with id numbers.
-
getFeatureMap
Returns this dataset'sFeatureMap
.- Returns:
- The feature map from this dataset.
-
iterator
-
toString
-
createTransformers
Takes aTransformationMap
and converts it into aTransformerMap
by observing all the values in this dataset.Does not mutate the dataset, if you wish to apply the TransformerMap, use
MutableDataset.transform(org.tribuo.transform.TransformerMap)
orTransformerMap.transformDataset(org.tribuo.Dataset<T>)
.TransformerMaps operate on feature values which are present, sparse values are ignored and not transformed. If the zeros should be transformed, call
MutableDataset.densify()
on the datasets before applying a transformer.This method calls
createTransformers(TransformationMap, boolean)
withincludeImplicitZeroFeatures
set to false, thus ignoring implicitly zero features when fitting the transformations. This is the default behaviour in Tribuo 4.0, but causes erroneous behaviour inIDFTransformation
so should be avoided with that transformation. Seeorg.tribuo.transform
for a more detailed discussion of densify and includeImplicitZeroFeatures.Throws
IllegalArgumentException
if the TransformationMap object has regexes which apply to multiple features.- Parameters:
transformations
- The transformations to fit.- Returns:
- A TransformerMap which can apply the transformations to a dataset.
-
createTransformers
public TransformerMap createTransformers(TransformationMap transformations, boolean includeImplicitZeroFeatures) Takes aTransformationMap
and converts it into aTransformerMap
by observing all the values in this dataset.Does not mutate the dataset, if you wish to apply the TransformerMap, use
MutableDataset.transform(org.tribuo.transform.TransformerMap)
orTransformerMap.transformDataset(org.tribuo.Dataset<T>)
.TransformerMaps operate on feature values which are present, sparse values are ignored and not transformed. If the zeros should be transformed, call
MutableDataset.densify()
on the datasets before applying a transformer. Seeorg.tribuo.transform
for a more detailed discussion of densify and includeImplicitZeroFeatures.Throws
IllegalArgumentException
if the TransformationMap object has regexes which apply to multiple features.- Parameters:
transformations
- The transformations to fit.includeImplicitZeroFeatures
- Use the implicit zero feature values to construct the transformations.- Returns:
- A TransformerMap which can apply the transformations to a dataset.
-
validate
Validates that this Dataset does in fact contain the supplied output type.As the output type is erased at runtime, deserialising a Dataset is an unchecked operation. This method allows the user to check that the deserialised dataset is of the appropriate type, rather than seeing if the Dataset throws a
ClassCastException
when used.- Parameters:
clazz
- The class object to verify the output type against.- Returns:
- True if the output type is assignable to the class object type, false otherwise.
-
castDataset
public static <T extends Output<T>> Dataset<T> castDataset(Dataset<?> inputDataset, Class<T> outputType) Casts the dataset to the specified output type, assuming it is valid.If it's not valid, throws
ClassCastException
.- Type Parameters:
T
- The output type.- Parameters:
inputDataset
- The model to cast.outputType
- The output type to cast to.- Returns:
- The model cast to the correct value.
-