public class SplitPatternTokenizer extends Object implements Tokenizer
Tokenizer
is instantiated with a regular
expression pattern which determines how to split a string into tokens. That
is, the pattern defines the "splits", not the tokens. For example, to
tokenize on white space provide the pattern "\s+".Modifier and Type | Field and Description |
---|---|
static String |
SIMPLE_DEFAULT_PATTERN
The default split pattern, which is [\.,]?\s+.
|
Constructor and Description |
---|
SplitPatternTokenizer()
Initializes a case insensitive tokenizer with the pattern [\.,]?\s+
|
SplitPatternTokenizer(String splitPatternRegex)
Constructs a splitting tokenizer using the supplied regex.
|
Modifier and Type | Method and Description |
---|---|
boolean |
advance()
Advances the tokenizer to the next token.
|
SplitPatternTokenizer |
clone()
Clones a tokenizer with it's configuration.
|
int |
getEnd()
Gets the ending offset (exclusive) of the current token in the character
sequence
|
com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance |
getProvenance() |
String |
getSplitPatternRegex()
Gets the String form of the regex in use.
|
int |
getStart()
Gets the starting character offset of the current token in the character
sequence
|
String |
getText()
Gets the text of the current token, as a string
|
Token.TokenType |
getType()
Gets the type of the current token.
|
void |
postConfig()
Used by the OLCUT configuration system, and should not be called by external code.
|
void |
reset(CharSequence cs)
Resets the tokenizer so that it operates on a new sequence of characters.
|
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
createSupplier, createThreadLocal, getToken, split, tokenize
public static final String SIMPLE_DEFAULT_PATTERN
public SplitPatternTokenizer()
public SplitPatternTokenizer(String splitPatternRegex)
splitPatternRegex
- The regex to use.public void postConfig()
postConfig
in interface com.oracle.labs.mlrg.olcut.config.Configurable
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
getProvenance
in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
public String getSplitPatternRegex()
public void reset(CharSequence cs)
Tokenizer
public boolean advance()
Tokenizer
public String getText()
Tokenizer
public int getStart()
Tokenizer
public int getEnd()
Tokenizer
public Token.TokenType getType()
Tokenizer
public SplitPatternTokenizer clone()
Tokenizer
Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.