public interface Tokenizer extends com.oracle.labs.mlrg.olcut.config.Configurable, Cloneable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
Note that tokenizers are not guaranteed to be thread safe! Using the same tokenizer from multiple threads may result in strange behavior.
Tokenizers which are not ready throw IllegalStateException
when advance()
or any get method is called.
Most Tokenizers are Cloneable, and implement the Cloneable interface.
Modifier and Type | Method and Description |
---|---|
boolean |
advance()
Advances the tokenizer to the next token.
|
Tokenizer |
clone()
Clones a tokenizer with it's configuration.
|
static Supplier<Tokenizer> |
createSupplier(Tokenizer tokenizer) |
static ThreadLocal<Tokenizer> |
createThreadLocal(Tokenizer tokenizer) |
int |
getEnd()
Gets the ending offset (exclusive) of the current token in the character
sequence
|
int |
getStart()
Gets the starting character offset of the current token in the character
sequence
|
String |
getText()
Gets the text of the current token, as a string
|
default Token |
getToken()
Generates a Token object from the current state of the tokenizer.
|
Token.TokenType |
getType()
Gets the type of the current token.
|
void |
reset(CharSequence cs)
Resets the tokenizer so that it operates on a new sequence of characters.
|
default List<String> |
split(CharSequence cs)
Uses this tokenizer to split a string into it's component substrings.
|
default List<Token> |
tokenize(CharSequence cs)
Uses this tokenizer to tokenize a string and return the list of tokens
that were generated.
|
static ThreadLocal<Tokenizer> createThreadLocal(Tokenizer tokenizer)
void reset(CharSequence cs)
cs
- a character sequence to tokenizeboolean advance()
true
if there is such a token, false
otherwise.String getText()
int getStart()
int getEnd()
Token.TokenType getType()
Tokenizer clone() throws CloneNotSupportedException
CloneNotSupportedException
- if the tokenizer isn't cloneable.default Token getToken()
default List<Token> tokenize(CharSequence cs)
Here is the contract of the tokenize function:
cs
- a sequence of characters to tokenizedefault List<String> split(CharSequence cs)
cs
- the character sequence to tokenizeCopyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.