Package org.tribuo.util.tokens.impl
Class SplitCharactersTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.SplitFunctionTokenizer
org.tribuo.util.tokens.impl.SplitCharactersTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
,Cloneable
,Tokenizer
This implementation of
Tokenizer
is instantiated with an array of
characters that are considered split characters. That is, the split
characters define where to split the input text. It's a very simplistic
tokenizer that has one simple exceptional case that it handles: how to deal
with split characters that appear in between digits (e.g., 3/5 and 3.1415).
It's not really very general purpose, but may suffice for some use cases.
In addition to the split characters specified it also splits on anything that
is considered whitespace by Character.isWhitespace(char)
.
- Author:
- Philip Ogren
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
Splits tokens at the supplied characters.Nested classes/interfaces inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
SplitFunctionTokenizer.SplitFunction, SplitFunctionTokenizer.SplitResult, SplitFunctionTokenizer.SplitType
-
Field Summary
Modifier and TypeFieldDescriptionstatic final char[]
The default split characters.static final char[]
The default characters which don't cause splits inside digits.Fields inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
splitFunction
-
Constructor Summary
ConstructorDescriptionCreates a default split characters tokenizer usingDEFAULT_SPLIT_CHARACTERS
andDEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
.SplitCharactersTokenizer
(char[] splitCharacters, char[] splitXDigitsCharacters) -
Method Summary
Modifier and TypeMethodDescriptionclone()
Clones a tokenizer with it's configuration.static SplitCharactersTokenizer
Creates a tokenizer that splits on whitespace.com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
char[]
Deprecated.char[]
Deprecated.boolean
isSplitCharacter
(char c) Deprecated.boolean
isSplitXDigitCharacter
(char c) Deprecated.void
Methods inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
advance, getEnd, getStart, getText, getType, reset
-
Field Details
-
DEFAULT_SPLIT_CHARACTERS
public static final char[] DEFAULT_SPLIT_CHARACTERSThe default split characters. -
DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
public static final char[] DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERSThe default characters which don't cause splits inside digits.
-
-
Constructor Details
-
SplitCharactersTokenizer
public SplitCharactersTokenizer()Creates a default split characters tokenizer usingDEFAULT_SPLIT_CHARACTERS
andDEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
. -
SplitCharactersTokenizer
public SplitCharactersTokenizer(char[] splitCharacters, char[] splitXDigitsCharacters) - Parameters:
splitCharacters
- characters to be replaced with a space in the input text (e.g., "abc|def" becomes "abc def")splitXDigitsCharacters
- characters to be replaced with a space in the input text except in the circumstance where the character immediately adjacent to the left and right are digits (e.g., "abc.def" becomes "abc def" but "3.1415" remains "3.1415").
-
-
Method Details
-
postConfig
public void postConfig() -
createWhitespaceTokenizer
Creates a tokenizer that splits on whitespace.- Returns:
- A whitespace tokenizer.
-
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance() -
isSplitCharacter
Deprecated.Is this character a split character for this tokenizer instance.- Parameters:
c
- The character to check.- Returns:
- True if it's a split character.
-
isSplitXDigitCharacter
Deprecated.Is this character a split character except inside a digit for this tokenizer instance.- Parameters:
c
- The character to check.- Returns:
- True if it's a split character.
-
getSplitCharacters
Deprecated.Returns a copy of the split characters.- Returns:
- A copy of the split characters.
-
getSplitXDigitsCharacters
Deprecated.Returns a copy of the split characters except inside digits.- Returns:
- A copy of the split characters.
-
clone
Description copied from interface:Tokenizer
Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.- Specified by:
clone
in interfaceTokenizer
- Overrides:
clone
in classSplitFunctionTokenizer
- Returns:
- A tokenizer with the same configuration, but independent state.
-