Package org.tribuo.util.tokens.impl
Class SplitPatternTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.SplitPatternTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
,Cloneable
,Tokenizer
This implementation of
Tokenizer
is instantiated with a regular
expression pattern which determines how to split a string into tokens. That
is, the pattern defines the "splits", not the tokens. For example, to
tokenize on white space provide the pattern "\s+".- Author:
- Philip Ogren
-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
The default split pattern, which is [\.,]?\s+. -
Constructor Summary
ConstructorDescriptionInitializes a case insensitive tokenizer with the pattern [\.,]?\s+SplitPatternTokenizer
(String splitPatternRegex) Constructs a splitting tokenizer using the supplied regex. -
Method Summary
Modifier and TypeMethodDescriptionboolean
advance()
Advances the tokenizer to the next token.clone()
Clones a tokenizer with it's configuration.int
getEnd()
Gets the ending offset (exclusive) of the current token in the character sequencecom.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
Gets the String form of the regex in use.int
getStart()
Gets the starting character offset of the current token in the character sequencegetText()
Gets the text of the current token, as a stringgetType()
Gets the type of the current token.void
Used by the OLCUT configuration system, and should not be called by external code.void
reset
(CharSequence cs) Resets the tokenizer so that it operates on a new sequence of characters.
-
Field Details
-
SIMPLE_DEFAULT_PATTERN
The default split pattern, which is [\.,]?\s+.- See Also:
-
-
Constructor Details
-
SplitPatternTokenizer
public SplitPatternTokenizer()Initializes a case insensitive tokenizer with the pattern [\.,]?\s+ -
SplitPatternTokenizer
Constructs a splitting tokenizer using the supplied regex.- Parameters:
splitPatternRegex
- The regex to use.
-
-
Method Details
-
postConfig
public void postConfig()Used by the OLCUT configuration system, and should not be called by external code.- Specified by:
postConfig
in interfacecom.oracle.labs.mlrg.olcut.config.Configurable
-
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()- Specified by:
getProvenance
in interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
-
getSplitPatternRegex
Gets the String form of the regex in use.- Returns:
- The regex.
-
reset
Description copied from interface:Tokenizer
Resets the tokenizer so that it operates on a new sequence of characters. -
advance
public boolean advance()Description copied from interface:Tokenizer
Advances the tokenizer to the next token. -
getText
Description copied from interface:Tokenizer
Gets the text of the current token, as a string -
getStart
public int getStart()Description copied from interface:Tokenizer
Gets the starting character offset of the current token in the character sequence -
getEnd
public int getEnd()Description copied from interface:Tokenizer
Gets the ending offset (exclusive) of the current token in the character sequence -
getType
Description copied from interface:Tokenizer
Gets the type of the current token. -
clone
Description copied from interface:Tokenizer
Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
-