Package org.tribuo.util.tokens
Interface Tokenizer
- All Superinterfaces:
Cloneable
,com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
- All Known Implementing Classes:
BreakIteratorTokenizer
,NonTokenizer
,ShapeTokenizer
,SplitCharactersTokenizer
,SplitFunctionTokenizer
,SplitPatternTokenizer
,UniversalTokenizer
,WhitespaceTokenizer
,WordpieceBasicTokenizer
,WordpieceTokenizer
public interface Tokenizer
extends com.oracle.labs.mlrg.olcut.config.Configurable, Cloneable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
An interface for things that tokenize text: breaking it into words according
to some set of rules.
Note that tokenizers are not guaranteed to be thread safe! Using the same tokenizer from multiple threads may result in strange behavior.
Tokenizers which are not ready throw IllegalStateException
when advance()
or any get method is called.
Most Tokenizers are Cloneable, and implement the Cloneable interface.
-
Method Summary
Modifier and TypeMethodDescriptionboolean
advance()
Advances the tokenizer to the next token.clone()
Clones a tokenizer with it's configuration.createSupplier
(Tokenizer tokenizer) Creates a supplier from the specified tokenizer by cloning it.static ThreadLocal<Tokenizer>
createThreadLocal
(Tokenizer tokenizer) Creates a thread local source of tokenizers by making a Tokenizer supplier usingcreateSupplier(Tokenizer)
.int
getEnd()
Gets the ending offset (exclusive) of the current token in the character sequenceint
getStart()
Gets the starting character offset of the current token in the character sequencegetText()
Gets the text of the current token, as a stringdefault Token
getToken()
Generates a Token object from the current state of the tokenizer.getType()
Gets the type of the current token.void
reset
(CharSequence cs) Resets the tokenizer so that it operates on a new sequence of characters.split
(CharSequence cs) Uses this tokenizer to split a string into it's component substrings.tokenize
(CharSequence cs) Uses this tokenizer to tokenize a string and return the list of tokens that were generated.Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig
Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable
getProvenance
-
Method Details
-
createSupplier
Creates a supplier from the specified tokenizer by cloning it.- Parameters:
tokenizer
- The tokenizer to copy.- Returns:
- A supplier of tokenizers.
-
createThreadLocal
Creates a thread local source of tokenizers by making a Tokenizer supplier usingcreateSupplier(Tokenizer)
.- Parameters:
tokenizer
- The tokenizer to copy.- Returns:
- A thread local for tokenizers.
-
reset
Resets the tokenizer so that it operates on a new sequence of characters.- Parameters:
cs
- a character sequence to tokenize
-
advance
boolean advance()Advances the tokenizer to the next token.- Returns:
true
if there is such a token,false
otherwise.
-
getText
String getText()Gets the text of the current token, as a string- Returns:
- the text of the current token
-
getStart
int getStart()Gets the starting character offset of the current token in the character sequence- Returns:
- the starting character offset of the token
-
getEnd
int getEnd()Gets the ending offset (exclusive) of the current token in the character sequence- Returns:
- the exclusive ending character offset for the current token.
-
getType
Token.TokenType getType()Gets the type of the current token.- Returns:
- the type of the current token.
-
clone
Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.- Returns:
- A tokenizer with the same configuration, but independent state.
- Throws:
CloneNotSupportedException
- if the tokenizer isn't cloneable.
-
getToken
Generates a Token object from the current state of the tokenizer.- Returns:
- The token object from the current state.
-
tokenize
Uses this tokenizer to tokenize a string and return the list of tokens that were generated. Many applications will simply want to take a character sequence and get a list of tokens, so this will do that for them.Here is the contract of the tokenize function:
- all returned tokens correspond to substrings of the input text
- the tokens do not overlap
- the tokens are returned in the order that they appear in the text
- the value of Token.text should be the same as calling text.substring(token.start, token.end)
- Parameters:
cs
- a sequence of characters to tokenize- Returns:
- the tokens discovered in the character sequence, in order (true?).
-
split
Uses this tokenizer to split a string into it's component substrings. Many applications will simply want the component strings making up a larger character sequence.- Parameters:
cs
- the character sequence to tokenize- Returns:
- a list of strings making up the character sequence.
-