public class SplitCharactersTokenizer extends SplitFunctionTokenizer
Tokenizer
is instantiated with an array of
characters that are considered split characters. That is, the split
characters define where to split the input text. It's a very simplistic
tokenizer that has one simple exceptional case that it handles: how to deal
with split characters that appear in between digits (e.g., 3/5 and 3.1415).
It's not really very general purpose, but may suffice for some use cases.
In addition to the split characters specified it also splits on anything that
is considered whitespace by Character.isWhitespace(char)
.
Modifier and Type | Class and Description |
---|---|
static class |
SplitCharactersTokenizer.SplitCharactersSplitterFunction
Splits tokens at the supplied characters.
|
SplitFunctionTokenizer.SplitFunction, SplitFunctionTokenizer.SplitResult, SplitFunctionTokenizer.SplitType
Modifier and Type | Field and Description |
---|---|
static char[] |
DEFAULT_SPLIT_CHARACTERS |
static char[] |
DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS |
splitFunction
Constructor and Description |
---|
SplitCharactersTokenizer()
Creates a default split characters tokenizer using
DEFAULT_SPLIT_CHARACTERS and
DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS . |
SplitCharactersTokenizer(char[] splitCharacters,
char[] splitXDigitsCharacters) |
Modifier and Type | Method and Description |
---|---|
SplitCharactersTokenizer |
clone()
Clones a tokenizer with it's configuration.
|
static SplitCharactersTokenizer |
createWhitespaceTokenizer()
Creates a tokenizer that splits on whitespace.
|
com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance |
getProvenance() |
char[] |
getSplitCharacters()
Deprecated.
|
char[] |
getSplitXDigitsCharacters()
Deprecated.
|
boolean |
isSplitCharacter(char c)
Deprecated.
|
boolean |
isSplitXDigitCharacter(char c)
Deprecated.
|
void |
postConfig() |
advance, getEnd, getStart, getText, getType, reset
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
createSupplier, createThreadLocal, getToken, split, tokenize
public static final char[] DEFAULT_SPLIT_CHARACTERS
public static final char[] DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
public SplitCharactersTokenizer()
DEFAULT_SPLIT_CHARACTERS
and
DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
.public SplitCharactersTokenizer(char[] splitCharacters, char[] splitXDigitsCharacters)
splitCharacters
- characters to be replaced with a space in the
input text (e.g., "abc|def" becomes "abc def")splitXDigitsCharacters
- characters to be replaced with a space in the
input text except in the circumstance where the
character immediately adjacent to the left and
right are digits (e.g., "abc.def" becomes "abc
def" but "3.1415" remains "3.1415").public void postConfig()
public static SplitCharactersTokenizer createWhitespaceTokenizer()
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
@Deprecated public boolean isSplitCharacter(char c)
c
- The character to check.@Deprecated public boolean isSplitXDigitCharacter(char c)
c
- The character to check.@Deprecated public char[] getSplitCharacters()
@Deprecated public char[] getSplitXDigitsCharacters()
public SplitCharactersTokenizer clone()
Tokenizer
clone
in interface Tokenizer
clone
in class SplitFunctionTokenizer
Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.