public class TokenPipeline extends Object implements TextPipeline
Constructor and Description |
---|
TokenPipeline(Tokenizer tokenizer,
int ngram,
boolean termCounting)
Creates a new token pipeline.
|
TokenPipeline(Tokenizer tokenizer,
int ngram,
boolean termCounting,
int dimension)
Creates a new token pipeline.
|
Modifier and Type | Method and Description |
---|---|
com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance |
getProvenance() |
void |
postConfig()
Used by the OLCUT configuration system, and should not be called by external code.
|
List<Feature> |
process(String tag,
String data)
Extracts a list of features from the supplied text, using the tag to prepend the feature names.
|
String |
toString() |
public TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting)
tokenizer
- The tokenizer to use to split up the text into words (i.e.,
features.)ngram
- The maximum size of ngram features to add to the features
generated by the pipeline. A value of n
means that ngram features
of size 1-n will be generated. A good standard value to use is 2, which means
that unigram and bigram features will be generated. You will very likely see
diminishing returns for larger values of n
but there will be times
when they will be necessary.termCounting
- If true
, multiple occurrences of terms
in the document will be counted and the count will be used as the value
of the features that are produced.public TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension)
tokenizer
- The tokenizer to use to split up the text into words
(i.e., features.)ngram
- The maximum size of ngram features to add to the features
generated by the pipeline. A value of n
means that ngram
features of size 1-n will be generated. A good standard value to use is
2, which means that unigram and bigram features will be generated. You
will very likely see diminishing returns for larger values of
n
but there will be times when they will be necessary.termCounting
- If true
, multiple occurrences of terms
in the document will be counted and the count will be used as the value
of the features that are produced.dimension
- The maximum dimension for the feature space. If this value
is greater than 0, then at most dimension
features will be
through the use of a hashing function that will collapse the feature
space.public void postConfig()
postConfig
in interface com.oracle.labs.mlrg.olcut.config.Configurable
public List<Feature> process(String tag, String data)
TextPipeline
process
in interface TextPipeline
tag
- The feature name tag.data
- The text to extract.public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
getProvenance
in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.