Package org.tribuo.dataset
Class MinimumCardinalityDataset<T extends Output<T>>
java.lang.Object
org.tribuo.Dataset<T>
org.tribuo.ImmutableDataset<T>
org.tribuo.dataset.MinimumCardinalityDataset<T>
- Type Parameters:
T
- The type of the outputs in thisDataset
.
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>
,Serializable
,Iterable<Example<T>>
This class creates a pruned dataset in which low frequency features that
occur less than the provided minimum cardinality have been removed. This can
be useful when the dataset is very large due to many low-frequency features.
For example, this class can be used to remove low frequency words from a BoW
formatted dataset. Here, a new dataset is created so that the feature counts
are recalculated and so that the original, passed-in dataset is not modified.
The returned dataset may have fewer examples because if any of the examples
have no features after the minimum cardinality has been applied, then those
examples will not be added to the constructed dataset.
- See Also:
-
Nested Class Summary
-
Field Summary
Fields inherited from class org.tribuo.ImmutableDataset
dropInvalidExamples, featureIDMap, outputIDInfo
Fields inherited from class org.tribuo.Dataset
data, indices, outputFactory, sourceProvenance
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionint
The minimum cardinality threshold for the features.int
The number of examples removed due to a lack of features.The feature names that were removed.Methods inherited from class org.tribuo.ImmutableDataset
add, add, copyDataset, copyDataset, copyDataset, getDropInvalidExamples, getFeatureIDMap, getFeatureMap, getOutputIDInfo, getOutputInfo, getOutputs, hashFeatureMap, toString
Methods inherited from class org.tribuo.Dataset
castDataset, createTransformers, createTransformers, getData, getExample, getOutputFactory, getSourceDescription, getSourceProvenance, iterator, shuffle, size, validate
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Constructor Details
-
MinimumCardinalityDataset
- Parameters:
dataset
- this dataset is left untouched and is used to populate the constructed dataset.minCardinality
- features with a frequency less than minCardinality will be removed.
-
-
Method Details
-
getRemoved
The feature names that were removed.- Returns:
- The feature names.
-
getNumExamplesRemoved
public int getNumExamplesRemoved()The number of examples removed due to a lack of features.- Returns:
- The number of removed examples.
-
getMinCardinality
public int getMinCardinality()The minimum cardinality threshold for the features.- Returns:
- The cardinality threshold.
-
getProvenance
- Specified by:
getProvenance
in interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<T extends Output<T>>
- Overrides:
getProvenance
in classImmutableDataset<T extends Output<T>>
-