Class SplitPatternTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.SplitPatternTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>,Cloneable,Tokenizer
This implementation of
Tokenizer is instantiated with a regular
expression pattern which determines how to split a string into tokens. That
is, the pattern defines the "splits", not the tokens. For example, to
tokenize on white space provide the pattern "\s+".- Author:
- Philip Ogren
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringThe default split pattern, which is [\.,]?\s+. -
Constructor Summary
ConstructorsConstructorDescriptionInitializes a case insensitive tokenizer with the pattern [\.,]?\s+SplitPatternTokenizer(String splitPatternRegex) Constructs a splitting tokenizer using the supplied regex. -
Method Summary
Modifier and TypeMethodDescriptionbooleanadvance()Advances the tokenizer to the next token.clone()Clones a tokenizer with it's configuration.intgetEnd()Gets the ending offset (exclusive) of the current token in the character sequencecom.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenanceGets the String form of the regex in use.intgetStart()Gets the starting character offset of the current token in the character sequencegetText()Gets the text of the current token, as a stringgetType()Gets the type of the current token.voidUsed by the OLCUT configuration system, and should not be called by external code.voidreset(CharSequence cs) Resets the tokenizer so that it operates on a new sequence of characters.
-
Field Details
-
SIMPLE_DEFAULT_PATTERN
The default split pattern, which is [\.,]?\s+.- See Also:
-
-
Constructor Details
-
SplitPatternTokenizer
public SplitPatternTokenizer()Initializes a case insensitive tokenizer with the pattern [\.,]?\s+ -
SplitPatternTokenizer
Constructs a splitting tokenizer using the supplied regex.- Parameters:
splitPatternRegex- The regex to use.
-
-
Method Details
-
postConfig
public void postConfig()Used by the OLCUT configuration system, and should not be called by external code.- Specified by:
postConfigin interfacecom.oracle.labs.mlrg.olcut.config.Configurable
-
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()- Specified by:
getProvenancein interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
-
getSplitPatternRegex
-
reset
Description copied from interface:TokenizerResets the tokenizer so that it operates on a new sequence of characters. -
advance
-
getText
-
getStart
-
getEnd
-
getType
Description copied from interface:TokenizerGets the type of the current token. -
clone
Description copied from interface:TokenizerClones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
-