Class TokenPipeline

java.lang.Object
org.tribuo.data.text.impl.TokenPipeline
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, TextPipeline

public class TokenPipeline extends Object implements TextPipeline
A pipeline for generating ngram features.
  • Constructor Details

    • TokenPipeline

      public TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting)
      Creates a new token pipeline.
      Parameters:
      tokenizer - The tokenizer to use to split up the text into words (i.e., features.)
      ngram - The maximum size of ngram features to add to the features generated by the pipeline. A value of n means that ngram features of size 1-n will be generated. A good standard value to use is 2, which means that unigram and bigram features will be generated. You will very likely see diminishing returns for larger values of n but there will be times when they will be necessary.
      termCounting - If true, multiple occurrences of terms in the document will be counted and the count will be used as the value of the features that are produced.
    • TokenPipeline

      public TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension)
      Creates a new token pipeline.
      Parameters:
      tokenizer - The tokenizer to use to split up the text into words (i.e., features.)
      ngram - The maximum size of ngram features to add to the features generated by the pipeline. A value of n means that ngram features of size 1-n will be generated. A good standard value to use is 2, which means that unigram and bigram features will be generated. You will very likely see diminishing returns for larger values of n but there will be times when they will be necessary.
      termCounting - If true, multiple occurrences of terms in the document will be counted and the count will be used as the value of the features that are produced.
      dimension - The maximum dimension for the feature space. If this value is greater than 0, then at most dimension features will be through the use of a hashing function that will collapse the feature space. This TokenPipeline will preserve the feature values when hashing, w.
    • TokenPipeline

      public TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension, boolean hashPreserveValue)
      Creates a new token pipeline.
      Parameters:
      tokenizer - The tokenizer to use to split up the text into words (i.e., features.)
      ngram - The maximum size of ngram features to add to the features generated by the pipeline. A value of n means that ngram features of size 1-n will be generated. A good standard value to use is 2, which means that unigram and bigram features will be generated. You will very likely see diminishing returns for larger values of n but there will be times when they will be necessary.
      termCounting - If true, multiple occurrences of terms in the document will be counted and the count will be used as the value of the features that are produced.
      dimension - The maximum dimension for the feature space. If this value is greater than 0, then at most dimension features will be through the use of a hashing function that will collapse the feature space.
      hashPreserveValue - If true, the hash function preserves the feature value, if false it hashes it into the values {-1, 1}.
  • Method Details

    • postConfig

      public void postConfig()
      Used by the OLCUT configuration system, and should not be called by external code.
      Specified by:
      postConfig in interface com.oracle.labs.mlrg.olcut.config.Configurable
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • process

      public List<Feature> process(String tag, String data)
      Description copied from interface: TextPipeline
      Extracts a list of features from the supplied text, using the tag to prepend the feature names.
      Specified by:
      process in interface TextPipeline
      Parameters:
      tag - The feature name tag.
      data - The text to extract.
      Returns:
      The extracted features.
    • getProvenance

      public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
      Specified by:
      getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>