java.lang.Object

org.tribuo.data.text.impl.TokenPipeline

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, TextPipeline

public class TokenPipeline extends Object implements TextPipeline

A pipeline for generating ngram features.

Constructor Summary

Constructors

Constructor

Description

TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting)

Creates a new token pipeline.

TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension)

Creates a new token pipeline.

TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension, boolean hashPreserveValue)

Creates a new token pipeline.
Method Summary

Modifier and Type

Method

Description

com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance

getProvenance()

void

postConfig()

Used by the OLCUT configuration system, and should not be called by external code.

List<Feature>

process(String tag, String data)

Extracts a list of features from the supplied text, using the tag to prepend the feature names.

String

toString()

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Constructor Details
- TokenPipeline
  
  public TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting)
  
  Creates a new token pipeline.
  
  Parameters:
  
  tokenizer - The tokenizer to use to split up the text into words (i.e., features.)
  
  ngram - The maximum size of ngram features to add to the features generated by the pipeline. A value of n means that ngram features of size 1-n will be generated. A good standard value to use is 2, which means that unigram and bigram features will be generated. You will very likely see diminishing returns for larger values of n but there will be times when they will be necessary.
  
  termCounting - If true, multiple occurrences of terms in the document will be counted and the count will be used as the value of the features that are produced.
- TokenPipeline
  
  public TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension)
  
  Creates a new token pipeline.
  
  Parameters:
  
  tokenizer - The tokenizer to use to split up the text into words (i.e., features.)
  
  ngram - The maximum size of ngram features to add to the features generated by the pipeline. A value of n means that ngram features of size 1-n will be generated. A good standard value to use is 2, which means that unigram and bigram features will be generated. You will very likely see diminishing returns for larger values of n but there will be times when they will be necessary.
  
  termCounting - If true, multiple occurrences of terms in the document will be counted and the count will be used as the value of the features that are produced.
  
  dimension - The maximum dimension for the feature space. If this value is greater than 0, then at most dimension features will be through the use of a hashing function that will collapse the feature space. This TokenPipeline will preserve the feature values when hashing, w.
- TokenPipeline
  
  public TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension, boolean hashPreserveValue)
  
  Creates a new token pipeline.
  
  Parameters:
  
  tokenizer - The tokenizer to use to split up the text into words (i.e., features.)
  
  ngram - The maximum size of ngram features to add to the features generated by the pipeline. A value of n means that ngram features of size 1-n will be generated. A good standard value to use is 2, which means that unigram and bigram features will be generated. You will very likely see diminishing returns for larger values of n but there will be times when they will be necessary.
  
  termCounting - If true, multiple occurrences of terms in the document will be counted and the count will be used as the value of the features that are produced.
  
  dimension - The maximum dimension for the feature space. If this value is greater than 0, then at most dimension features will be through the use of a hashing function that will collapse the feature space.
  
  hashPreserveValue - If true, the hash function preserves the feature value, if false it hashes it into the values {-1, 1}.
Method Details
- postConfig
  
  public void postConfig()
  
  Used by the OLCUT configuration system, and should not be called by external code.
  
  Specified by:
  
  postConfig in interface com.oracle.labs.mlrg.olcut.config.Configurable
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object
- process
  
  public List<Feature> process(String tag, String data)
  
  Description copied from interface: TextPipeline
  
  Extracts a list of features from the supplied text, using the tag to prepend the feature names.
  
  Specified by:
  
  process in interface TextPipeline
  
  Parameters:
  
  tag - The feature name tag.
  
  data - The text to extract.
  
  Returns:
  
  The extracted features.
- getProvenance
  
  public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
  
  Specified by:
  
  getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>

Class TokenPipeline

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

TokenPipeline

TokenPipeline

TokenPipeline

Method Details

postConfig

toString

process

getProvenance