Package org.tribuo.util.tokens.impl
Class SplitFunctionTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.SplitFunctionTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
,Cloneable
,Tokenizer
- Direct Known Subclasses:
SplitCharactersTokenizer
,WhitespaceTokenizer
,WordpieceBasicTokenizer
This class supports character-by-character (that is, codepoint-by-codepoint)
iteration over input text to create tokens. Extensions of this class are
initialized with a
SplitFunctionTokenizer.SplitFunction
which will be called for each character and
a SplitFunctionTokenizer.SplitResult
consisting of a SplitFunctionTokenizer.SplitType
and a Token.TokenType
will be returned.
Tokenization is achieved based on the SplitFunctionTokenizer.SplitResult
returned for each
character. Please see notes below for each SplitFunctionTokenizer.SplitType
and SplitFunctionTokenizer.SplitResult
.-
Nested Class Summary
Modifier and TypeClassDescriptionstatic interface
An interface for checking if the text should be split at the supplied codepoint.static enum
A combination of aSplitFunctionTokenizer.SplitType
and aToken.TokenType
.static enum
Defines different ways that a tokenizer can split the input text at a given character. -
Field Summary
Modifier and TypeFieldDescriptionprotected SplitFunctionTokenizer.SplitFunction
The splitting function. -
Constructor Summary
ModifierConstructorDescriptionprotected
Constructs a tokenizer, used by OLCUT.SplitFunctionTokenizer
(SplitFunctionTokenizer.SplitFunction splitFunction) Creates a new tokenizer using the supplied split function. -
Method Summary
Modifier and TypeMethodDescriptionboolean
advance()
Advances the tokenizer to the next token.clone()
Clones a tokenizer with it's configuration.int
getEnd()
Gets the ending offset (exclusive) of the current token in the character sequenceint
getStart()
Gets the starting character offset of the current token in the character sequencegetText()
Gets the text of the current token, as a stringgetType()
Gets the type of the current token.void
reset
(CharSequence cs) Resets the tokenizer so that it operates on a new sequence of characters.Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig
Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable
getProvenance
-
Field Details
-
splitFunction
The splitting function.
-
-
Constructor Details
-
SplitFunctionTokenizer
protected SplitFunctionTokenizer()Constructs a tokenizer, used by OLCUT. -
SplitFunctionTokenizer
Creates a new tokenizer using the supplied split function.- Parameters:
splitFunction
- The split function.
-
-
Method Details
-
reset
Description copied from interface:Tokenizer
Resets the tokenizer so that it operates on a new sequence of characters. -
advance
public boolean advance()Description copied from interface:Tokenizer
Advances the tokenizer to the next token. -
getText
Description copied from interface:Tokenizer
Gets the text of the current token, as a string -
getStart
public int getStart()Description copied from interface:Tokenizer
Gets the starting character offset of the current token in the character sequence -
getEnd
public int getEnd()Description copied from interface:Tokenizer
Gets the ending offset (exclusive) of the current token in the character sequence -
getType
Description copied from interface:Tokenizer
Gets the type of the current token. -
clone
Description copied from interface:Tokenizer
Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.- Specified by:
clone
in interfaceTokenizer
- Overrides:
clone
in classObject
- Returns:
- A tokenizer with the same configuration, but independent state.
- Throws:
CloneNotSupportedException
- if the tokenizer isn't cloneable.
-