Class MinimumCardinalityDataset<T extends Output<T>>
java.lang.Object
org.tribuo.Dataset<T>
org.tribuo.ImmutableDataset<T>
org.tribuo.dataset.MinimumCardinalityDataset<T>
- Type Parameters:
T- The type of the outputs in thisDataset.
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>,Serializable,Iterable<Example<T>>,ProtoSerializable<org.tribuo.protos.core.DatasetProto>
This class creates a pruned dataset in which low frequency features that
occur less than the provided minimum cardinality have been removed. This can
be useful when the dataset is very large due to many low-frequency features.
For example, this class can be used to remove low frequency words from a BoW
formatted dataset. Here, a new dataset is created so that the feature counts
are recalculated and so that the original, passed-in dataset is not modified.
The returned dataset may have fewer examples because if any of the examples
have no features after the minimum cardinality has been applied, then those
examples will not be added to the constructed dataset.
- See Also:
-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intProtobuf serialization version.Fields inherited from class org.tribuo.ImmutableDataset
dropInvalidExamples, featureIDMap, outputIDInfoFields inherited from class org.tribuo.Dataset
data, indices, outputFactory, rng, sourceProvenance, tribuoVersionFields inherited from interface org.tribuo.protos.ProtoSerializable
DESERIALIZATION_METHOD_NAME, PROVENANCE_SERIALIZER -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic MinimumCardinalityDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) Deserialization factory.intThe minimum cardinality threshold for the features.intThe number of examples removed due to a lack of features.The feature names that were removed.org.tribuo.protos.core.DatasetProtoSerializes this object to a protobuf.Methods inherited from class org.tribuo.ImmutableDataset
add, add, copyDataset, copyDataset, copyDataset, getDropInvalidExamples, getFeatureIDMap, getFeatureMap, getOutputIDInfo, getOutputInfo, getOutputs, hashFeatureMap, toStringMethods inherited from class org.tribuo.Dataset
castDataset, createDataCarrier, createDataCarrier, createTransformers, createTransformers, deserialize, deserializeExamples, deserializeFromFile, deserializeFromStream, getData, getExample, getOutputFactory, getSourceDescription, getSourceProvenance, iterator, serializeToFile, serializeToStream, shuffle, size, validateMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, waitMethods inherited from interface java.lang.Iterable
forEach, spliterator
-
Field Details
-
CURRENT_VERSION
public static final int CURRENT_VERSIONProtobuf serialization version.- See Also:
-
-
Constructor Details
-
MinimumCardinalityDataset
-
-
Method Details
-
deserializeFromProto
public static MinimumCardinalityDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) throws com.google.protobuf.InvalidProtocolBufferException Deserialization factory.- Parameters:
version- The serialized object version.className- The class name.message- The serialized data.- Returns:
- The deserialized object.
- Throws:
com.google.protobuf.InvalidProtocolBufferException- If the protobuf could not be parsed from themessage.
-
getRemoved
-
getNumExamplesRemoved
public int getNumExamplesRemoved()The number of examples removed due to a lack of features.- Returns:
- The number of removed examples.
-
getMinCardinality
public int getMinCardinality()The minimum cardinality threshold for the features.- Returns:
- The cardinality threshold.
-
getProvenance
- Specified by:
getProvenancein interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<T extends Output<T>>- Overrides:
getProvenancein classImmutableDataset<T extends Output<T>>
-
serialize
public org.tribuo.protos.core.DatasetProto serialize()Description copied from interface:ProtoSerializableSerializes this object to a protobuf.- Specified by:
serializein interfaceProtoSerializable<T extends Output<T>>- Overrides:
serializein classImmutableDataset<T extends Output<T>>- Returns:
- The protobuf.
-