Class SplitCharactersTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.SplitCharactersTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
,Cloneable
,Tokenizer
This implementation of
Tokenizer
is instantiated with an array of
characters that are considered split characters. That is, the split
characters define where to split the input text. It's a very simplistic
tokenizer that has one simple exceptional case that it handles: how to deal
with split characters that appear in between digits (e.g., 3/5 and 3.1415).
It's not really very general purpose, but may suffice for some use cases.
In addition to the split characters specified it also splits on anything
that is considered whitespace by Character.isWhitespace(char)
.
- Author:
- Philip Ogren
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final char[]
static final char[]
-
Constructor Summary
ConstructorsConstructorDescriptionSplitCharactersTokenizer
(char[] splitCharacters, char[] splitXDigitsCharacters) -
Method Summary
Modifier and TypeMethodDescriptionboolean
advance()
Advances the tokenizer to the next token.clone()
Clones a tokenizer with it's configuration.static SplitCharactersTokenizer
Creates a tokenizer that splits on whitespace.int
getEnd()
Gets the ending offset (exclusive) of the current token in the character sequencecom.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
char[]
Returns a copy of the split characters.char[]
Returns a copy of the split characters except inside digits.int
getStart()
Gets the starting character offset of the current token in the character sequencegetText()
Gets the text of the current token, as a stringgetType()
Gets the type of the current token.boolean
isSplitCharacter
(char c) Is this character a split character for this tokenizer instance.boolean
isSplitXDigitCharacter
(char c) Is this character a split character except inside a digit for this tokenizer instance.void
reset
(CharSequence cs) Resets the tokenizer so that it operates on a new sequence of characters.Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig
-
Field Details
-
DEFAULT_SPLIT_CHARACTERS
-
DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
-
-
Constructor Details
-
SplitCharactersTokenizer
public SplitCharactersTokenizer() -
SplitCharactersTokenizer
- Parameters:
splitCharacters
- characters to be replaced with a space in the input text (e.g., "abc|def" becomes "abc def")splitXDigitsCharacters
- characters to be replaced with a space in the input text except in the circumstance where the character immediately adjacent to the left and right are digits (e.g., "abc.def" becomes "abc def" but "3.1415" remains "3.1415").
-
-
Method Details
-
createWhitespaceTokenizer
Creates a tokenizer that splits on whitespace.- Returns:
- A whitespace tokenizer.
-
getProvenance
- Specified by:
getProvenance
in interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
-
reset
Description copied from interface:Tokenizer
Resets the tokenizer so that it operates on a new sequence of characters. -
advance
-
getText
-
getStart
-
getEnd
-
getType
Description copied from interface:Tokenizer
Gets the type of the current token. -
clone
Description copied from interface:Tokenizer
Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence. -
isSplitCharacter
Is this character a split character for this tokenizer instance.- Parameters:
c
- The character to check.- Returns:
- True if it's a split character.
-
isSplitXDigitCharacter
Is this character a split character except inside a digit for this tokenizer instance.- Parameters:
c
- The character to check.- Returns:
- True if it's a split character.
-
getSplitCharacters
Returns a copy of the split characters.- Returns:
- A copy of the split characters.
-
getSplitXDigitsCharacters
Returns a copy of the split characters except inside digits.- Returns:
- A copy of the split characters.
-