Class Dataset<T extends Output<T>>

java.lang.Object
org.tribuo.Dataset<T>
Type Parameters:
T - the type of the features in the data set.
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>, Serializable, Iterable<Example<T>>
Direct Known Subclasses:
ImmutableDataset, MutableDataset

public abstract class Dataset<T extends Output<T>> extends Object implements Iterable<Example<T>>, com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>, Serializable
A class for sets of data, which are used to train and evaluate classifiers.

Subclass MutableDataset rather than this class.

See Also:
  • Field Details

  • Constructor Details

    • Dataset

      protected Dataset(DataProvenance provenance, OutputFactory<T> outputFactory)
      Creates a dataset.
      Parameters:
      provenance - A description of the data, including preprocessing steps.
      outputFactory - The output factory.
    • Dataset

      protected Dataset(DataSource<T> dataSource)
      Creates a dataset.
      Parameters:
      dataSource - the DataSource to use.
  • Method Details

    • getSourceDescription

      A String description of this dataset.
      Returns:
      The description
    • getSourceProvenance

      The provenance of the data this Dataset contains.
      Returns:
      The data provenance.
    • getData

      public List<Example<T>> getData()
      Gets the examples as an unmodifiable list. This list will throw an UnsupportedOperationException if any elements are added to it.

      In other words, using the following to add additional examples to this dataset with throw an exception: dataset.getData().add(example) Instead, use MutableDataset.add(Example).

      Returns:
      The unmodifiable example list.
    • getOutputFactory

      Gets the output factory this dataset contains.
      Returns:
      The output factory.
    • getOutputs

      public abstract Set<T> getOutputs()
      Gets the set of outputs that occur in the examples in this dataset.
      Returns:
      the set of outputs that occur in the examples in this dataset.
    • getExample

      public Example<T> getExample(int index)
      Gets the example at the supplied index.

      Throws IllegalArgumentException if the index is invalid or outside the bounds.

      Parameters:
      index - The index of the example.
      Returns:
      The example.
    • size

      public int size()
      Gets the size of the data set.
      Returns:
      the size of the data set.
    • shuffle

      public void shuffle(boolean shuffle)
      Shuffles the indices, or stops shuffling them.

      The shuffle only affects the iterator, it does not affect getExample(int).

      Multiple calls with the argument true will shuffle the dataset multiple times. The RNG is shared across all Dataset instances, so methods which access it are synchronized.

      Using this method will prevent the provenance system from tracking the exact state of the dataset, which may be important for trainers which depend on the example order, like those using stochastic gradient descent.

      Parameters:
      shuffle - If true shuffle the data.
    • getOutputIDInfo

      Returns or generates an ImmutableOutputInfo.
      Returns:
      An immutable output info.
    • getOutputInfo

      public abstract OutputInfo<T> getOutputInfo()
      Returns this dataset's OutputInfo.
      Returns:
      The output info.
    • getFeatureIDMap

      Returns or generates an ImmutableFeatureMap.
      Returns:
      An immutable feature map with id numbers.
    • getFeatureMap

      public abstract FeatureMap getFeatureMap()
      Returns this dataset's FeatureMap.
      Returns:
      The feature map from this dataset.
    • iterator

      public Iterator<Example<T>> iterator()
      Specified by:
      iterator in interface Iterable<T extends Output<T>>
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • createTransformers

      Takes a TransformationMap and converts it into a TransformerMap by observing all the values in this dataset.

      Does not mutate the dataset, if you wish to apply the TransformerMap, use MutableDataset.transform(org.tribuo.transform.TransformerMap) or TransformerMap.transformDataset(org.tribuo.Dataset<T>).

      Currently TransformationMaps and TransformerMaps only operate on feature values which are present, sparse values are ignored and not transformed. If the zeros should be transformed, call MutableDataset.densify() on the datasets.

      Throws IllegalArgumentException if the TransformationMap object has regexes which apply to multiple features.

      Parameters:
      transformations - The transformations to fit.
      Returns:
      A TransformerMap which can apply the transformations to a dataset.