Package org.tribuo.data.text.impl
Class TokenPipeline
java.lang.Object
org.tribuo.data.text.impl.TokenPipeline
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
,TextPipeline
A pipeline for generating ngram features.
-
Constructor Summary
ConstructorDescriptionTokenPipeline
(Tokenizer tokenizer, int ngram, boolean termCounting) Creates a new token pipeline.TokenPipeline
(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension) Creates a new token pipeline.TokenPipeline
(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension, boolean hashPreserveValue) Creates a new token pipeline. -
Method Summary
Modifier and TypeMethodDescriptioncom.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
void
Used by the OLCUT configuration system, and should not be called by external code.Extracts a list of features from the supplied text, using the tag to prepend the feature names.toString()
-
Constructor Details
-
TokenPipeline
Creates a new token pipeline.- Parameters:
tokenizer
- The tokenizer to use to split up the text into words (i.e., features.)ngram
- The maximum size of ngram features to add to the features generated by the pipeline. A value ofn
means that ngram features of size 1-n will be generated. A good standard value to use is 2, which means that unigram and bigram features will be generated. You will very likely see diminishing returns for larger values ofn
but there will be times when they will be necessary.termCounting
- Iftrue
, multiple occurrences of terms in the document will be counted and the count will be used as the value of the features that are produced.
-
TokenPipeline
Creates a new token pipeline.- Parameters:
tokenizer
- The tokenizer to use to split up the text into words (i.e., features.)ngram
- The maximum size of ngram features to add to the features generated by the pipeline. A value ofn
means that ngram features of size 1-n will be generated. A good standard value to use is 2, which means that unigram and bigram features will be generated. You will very likely see diminishing returns for larger values ofn
but there will be times when they will be necessary.termCounting
- Iftrue
, multiple occurrences of terms in the document will be counted and the count will be used as the value of the features that are produced.dimension
- The maximum dimension for the feature space. If this value is greater than 0, then at mostdimension
features will be through the use of a hashing function that will collapse the feature space. ThisTokenPipeline
will preserve the feature values when hashing, w.
-
TokenPipeline
public TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension, boolean hashPreserveValue) Creates a new token pipeline.- Parameters:
tokenizer
- The tokenizer to use to split up the text into words (i.e., features.)ngram
- The maximum size of ngram features to add to the features generated by the pipeline. A value ofn
means that ngram features of size 1-n will be generated. A good standard value to use is 2, which means that unigram and bigram features will be generated. You will very likely see diminishing returns for larger values ofn
but there will be times when they will be necessary.termCounting
- Iftrue
, multiple occurrences of terms in the document will be counted and the count will be used as the value of the features that are produced.dimension
- The maximum dimension for the feature space. If this value is greater than 0, then at mostdimension
features will be through the use of a hashing function that will collapse the feature space.hashPreserveValue
- If true, the hash function preserves the feature value, if false it hashes it into the values {-1, 1}.
-
-
Method Details
-
postConfig
public void postConfig()Used by the OLCUT configuration system, and should not be called by external code.- Specified by:
postConfig
in interfacecom.oracle.labs.mlrg.olcut.config.Configurable
-
toString
-
process
Description copied from interface:TextPipeline
Extracts a list of features from the supplied text, using the tag to prepend the feature names.- Specified by:
process
in interfaceTextPipeline
- Parameters:
tag
- The feature name tag.data
- The text to extract.- Returns:
- The extracted features.
-
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()- Specified by:
getProvenance
in interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
-