Class MinimumCardinalitySequenceDataset<T extends Output<T>>

java.lang.Object
org.tribuo.sequence.SequenceDataset<T>
org.tribuo.sequence.ImmutableSequenceDataset<T>
org.tribuo.sequence.MinimumCardinalitySequenceDataset<T>
Type Parameters:
T - The type of the outputs in this SequenceDataset.
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.provenance.Provenancable<DatasetProvenance>, Serializable, Iterable<SequenceExample<T>>, ProtoSerializable<org.tribuo.protos.core.SequenceDatasetProto>

public class MinimumCardinalitySequenceDataset<T extends Output<T>> extends ImmutableSequenceDataset<T>
This class creates a pruned dataset in which low frequency features that occur less than the provided minimum cardinality have been removed. This can be useful when the dataset is very large due to many low-frequency features. Here, a new dataset is created so that the feature counts are recalculated and so that the original, passed-in dataset is not modified. The returned dataset may have fewer sequence examples because if any of the sequence examples have examples with no features after the minimum cardinality has been applied, then those sequence examples will not be added to the constructed dataset.
See Also:
  • Field Details

    • CURRENT_VERSION

      public static final int CURRENT_VERSION
      Protobuf serialization version.
      See Also:
  • Constructor Details

    • MinimumCardinalitySequenceDataset

      public MinimumCardinalitySequenceDataset(SequenceDataset<T> sequenceDataset, int minCardinality)
      Parameters:
      sequenceDataset - this dataset is left untouched and is used to populate the constructed dataset.
      minCardinality - features with a frequency less than minCardinality will be removed.
  • Method Details

    • deserializeFromProto

      public static MinimumCardinalitySequenceDataset<?> deserializeFromProto(int version, String className, com.google.protobuf.Any message) throws com.google.protobuf.InvalidProtocolBufferException
      Deserialization factory.
      Parameters:
      version - The serialized object version.
      className - The class name.
      message - The serialized data.
      Returns:
      The deserialized object.
      Throws:
      com.google.protobuf.InvalidProtocolBufferException - If the protobuf could not be parsed from the message.
    • getRemoved

      public Set<String> getRemoved()
      The feature names that were removed.
      Returns:
      The feature names.
    • getNumExamplesRemoved

      public int getNumExamplesRemoved()
      The number of examples removed due to a lack of features.
      Returns:
      The number of removed examples.
    • getMinCardinality

      public int getMinCardinality()
      The minimum cardinality threshold for the features.
      Returns:
      The cardinality threshold.
    • serialize

      public org.tribuo.protos.core.SequenceDatasetProto serialize()
      Description copied from interface: ProtoSerializable
      Serializes this object to a protobuf.
      Specified by:
      serialize in interface ProtoSerializable<T extends Output<T>>
      Overrides:
      serialize in class ImmutableSequenceDataset<T extends Output<T>>
      Returns:
      The protobuf.
    • getProvenance

      public DatasetProvenance getProvenance()
      Specified by:
      getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<T extends Output<T>>
      Overrides:
      getProvenance in class ImmutableSequenceDataset<T extends Output<T>>