Class SplitFunctionTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.SplitFunctionTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>,Cloneable,Tokenizer
- Direct Known Subclasses:
SplitCharactersTokenizer,WhitespaceTokenizer,WordpieceBasicTokenizer
This class supports character-by-character (that is, codepoint-by-codepoint)
iteration over input text to create tokens. Extensions of this class are
initialized with a
SplitFunctionTokenizer.SplitFunction which will be called for each character and
a SplitFunctionTokenizer.SplitResult consisting of a SplitFunctionTokenizer.SplitType and a Token.TokenType will be returned.
Tokenization is achieved based on the SplitFunctionTokenizer.SplitResult returned for each
character. Please see notes below for each SplitFunctionTokenizer.SplitType and SplitFunctionTokenizer.SplitResult.-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic interfaceAn interface for checking if the text should be split at the supplied codepoint.static enumA combination of aSplitFunctionTokenizer.SplitTypeand aToken.TokenType.static enumDefines different ways that a tokenizer can split the input text at a given character. -
Field Summary
Fields -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedConstructs a tokenizer, used by OLCUT.SplitFunctionTokenizer(SplitFunctionTokenizer.SplitFunction splitFunction) Creates a new tokenizer using the supplied split function. -
Method Summary
Modifier and TypeMethodDescriptionbooleanadvance()Advances the tokenizer to the next token.clone()Clones a tokenizer with it's configuration.intgetEnd()Gets the ending offset (exclusive) of the current token in the character sequenceintgetStart()Gets the starting character offset of the current token in the character sequencegetText()Gets the text of the current token, as a stringgetType()Gets the type of the current token.voidreset(CharSequence cs) Resets the tokenizer so that it operates on a new sequence of characters.Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfigMethods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable
getProvenance
-
Field Details
-
splitFunction
-
-
Constructor Details
-
SplitFunctionTokenizer
protected SplitFunctionTokenizer()Constructs a tokenizer, used by OLCUT. -
SplitFunctionTokenizer
Creates a new tokenizer using the supplied split function.- Parameters:
splitFunction- The split function.
-
-
Method Details
-
reset
Description copied from interface:TokenizerResets the tokenizer so that it operates on a new sequence of characters. -
advance
-
getText
-
getStart
-
getEnd
-
getType
Description copied from interface:TokenizerGets the type of the current token. -
clone
Description copied from interface:TokenizerClones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.- Specified by:
clonein interfaceTokenizer- Overrides:
clonein classObject- Returns:
- A tokenizer with the same configuration, but independent state.
- Throws:
CloneNotSupportedException- if the tokenizer isn't cloneable.
-