public class SplitCharactersTokenizer extends Object implements Tokenizer
Tokenizer
is instantiated with an array of
characters that are considered split characters. That is, the split
characters define where to split the input text. It's a very simplistic
tokenizer that has one simple exceptional case that it handles: how to deal
with split characters that appear in between digits (e.g., 3/5 and 3.1415).
It's not really very general purpose, but may suffice for some use cases.
In addition to the split characters specified it also splits on anything
that is considered whitespace by Character.isWhitespace(char)
.
Modifier and Type | Field and Description |
---|---|
static char[] |
DEFAULT_SPLIT_CHARACTERS |
static char[] |
DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS |
Constructor and Description |
---|
SplitCharactersTokenizer() |
SplitCharactersTokenizer(char[] splitCharacters,
char[] splitXDigitsCharacters) |
Modifier and Type | Method and Description |
---|---|
boolean |
advance()
Advances the tokenizer to the next token.
|
SplitCharactersTokenizer |
clone()
Clones a tokenizer with it's configuration.
|
static SplitCharactersTokenizer |
createWhitespaceTokenizer()
Creates a tokenizer that splits on whitespace.
|
int |
getEnd()
Gets the ending offset (exclusive) of the current token in the character
sequence
|
com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance |
getProvenance() |
char[] |
getSplitCharacters()
Returns a copy of the split characters.
|
char[] |
getSplitXDigitsCharacters()
Returns a copy of the split characters except inside digits.
|
int |
getStart()
Gets the starting character offset of the current token in the character
sequence
|
String |
getText()
Gets the text of the current token, as a string
|
Token.TokenType |
getType()
Gets the type of the current token.
|
boolean |
isSplitCharacter(char c)
Is this character a split character for this tokenizer instance.
|
boolean |
isSplitXDigitCharacter(char c)
Is this character a split character except inside a digit for this tokenizer instance.
|
void |
reset(CharSequence cs)
Resets the tokenizer so that it operates on a new sequence of characters.
|
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
createSupplier, createThreadLocal, getToken, split, tokenize
public static final char[] DEFAULT_SPLIT_CHARACTERS
public static final char[] DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
public SplitCharactersTokenizer()
public SplitCharactersTokenizer(char[] splitCharacters, char[] splitXDigitsCharacters)
splitCharacters
- characters to be replaced with a space in the
input text (e.g., "abc|def" becomes "abc def")splitXDigitsCharacters
- characters to be replaced with a space in
the input text except in the circumstance where the character immediately
adjacent to the left and right are digits (e.g., "abc.def" becomes "abc
def" but "3.1415" remains "3.1415").public static SplitCharactersTokenizer createWhitespaceTokenizer()
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
getProvenance
in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
public void reset(CharSequence cs)
Tokenizer
public boolean advance()
Tokenizer
public String getText()
Tokenizer
public int getStart()
Tokenizer
public int getEnd()
Tokenizer
public Token.TokenType getType()
Tokenizer
public SplitCharactersTokenizer clone()
Tokenizer
public boolean isSplitCharacter(char c)
c
- The character to check.public boolean isSplitXDigitCharacter(char c)
c
- The character to check.public char[] getSplitCharacters()
public char[] getSplitXDigitsCharacters()
Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.