Package org.tribuo.sequence
Class MinimumCardinalitySequenceDataset<T extends Output<T>>
java.lang.Object
org.tribuo.sequence.SequenceDataset<T>
org.tribuo.sequence.ImmutableSequenceDataset<T>
org.tribuo.sequence.MinimumCardinalitySequenceDataset<T>
- Type Parameters:
T
- The type of the outputs in thisSequenceDataset
.
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>
,Serializable
,Iterable<SequenceExample<T>>
,ProtoSerializable<org.tribuo.protos.core.SequenceDatasetProto>
public class MinimumCardinalitySequenceDataset<T extends Output<T>>
extends ImmutableSequenceDataset<T>
This class creates a pruned dataset in which low frequency features that
occur less than the provided minimum cardinality have been removed. This can
be useful when the dataset is very large due to many low-frequency features.
Here, a new dataset is created so that the feature counts are recalculated
and so that the original, passed-in dataset is not modified. The returned
dataset may have fewer sequence examples because if any of the sequence
examples have examples with no features after the minimum cardinality has
been applied, then those sequence examples will not be added to the
constructed dataset.
- See Also:
-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
Protobuf serialization version.Fields inherited from class org.tribuo.sequence.ImmutableSequenceDataset
featureIDMap, outputIDInfo
Fields inherited from class org.tribuo.sequence.SequenceDataset
data, outputFactory, sourceProvenance, tribuoVersion
Fields inherited from interface org.tribuo.protos.ProtoSerializable
DESERIALIZATION_METHOD_NAME, PROVENANCE_SERIALIZER
-
Constructor Summary
ConstructorDescriptionMinimumCardinalitySequenceDataset
(SequenceDataset<T> sequenceDataset, int minCardinality) -
Method Summary
Modifier and TypeMethodDescriptionstatic MinimumCardinalitySequenceDataset<?>
deserializeFromProto
(int version, String className, com.google.protobuf.Any message) Deserialization factory.int
The minimum cardinality threshold for the features.int
The number of examples removed due to a lack of features.The feature names that were removed.org.tribuo.protos.core.SequenceDatasetProto
Serializes this object to a protobuf.Methods inherited from class org.tribuo.sequence.ImmutableSequenceDataset
add, add, copyDataset, copyDataset, copyDataset, getFeatureIDMap, getFeatureMap, getOutputIDInfo, getOutputInfo, getOutputs, toString
Methods inherited from class org.tribuo.sequence.SequenceDataset
castDataset, createDataCarrier, deserialize, deserializeExamples, deserializeFromFile, deserializeFromStream, getData, getExample, getFlatDataset, getOutputFactory, getSourceDescription, getSourceProvenance, iterator, serializeToFile, serializeToStream, size, validate
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Field Details
-
CURRENT_VERSION
public static final int CURRENT_VERSIONProtobuf serialization version.- See Also:
-
-
Constructor Details
-
MinimumCardinalitySequenceDataset
- Parameters:
sequenceDataset
- this dataset is left untouched and is used to populate the constructed dataset.minCardinality
- features with a frequency less than minCardinality will be removed.
-
-
Method Details
-
deserializeFromProto
public static MinimumCardinalitySequenceDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) throws com.google.protobuf.InvalidProtocolBufferException Deserialization factory.- Parameters:
version
- The serialized object version.className
- The class name.message
- The serialized data.- Returns:
- The deserialized object.
- Throws:
com.google.protobuf.InvalidProtocolBufferException
- If the protobuf could not be parsed from themessage
.
-
getRemoved
The feature names that were removed.- Returns:
- The feature names.
-
getNumExamplesRemoved
public int getNumExamplesRemoved()The number of examples removed due to a lack of features.- Returns:
- The number of removed examples.
-
getMinCardinality
public int getMinCardinality()The minimum cardinality threshold for the features.- Returns:
- The cardinality threshold.
-
serialize
public org.tribuo.protos.core.SequenceDatasetProto serialize()Description copied from interface:ProtoSerializable
Serializes this object to a protobuf.- Specified by:
serialize
in interfaceProtoSerializable<T extends Output<T>>
- Overrides:
serialize
in classImmutableSequenceDataset<T extends Output<T>>
- Returns:
- The protobuf.
-
getProvenance
- Specified by:
getProvenance
in interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<T extends Output<T>>
- Overrides:
getProvenance
in classImmutableSequenceDataset<T extends Output<T>>
-