Interface | Description |
---|---|
SplitFunctionTokenizer.SplitFunction |
An interface for checking if the text should be split at the supplied codepoint.
|
Class | Description |
---|---|
BreakIteratorTokenizer |
A tokenizer wrapping a
BreakIterator instance. |
NonTokenizer |
A convenience class for when you are required to provide a tokenizer but you
don't actually want to split up the text into tokens.
|
ShapeTokenizer |
This tokenizer is loosely based on the notion of word shape which is a common
feature used in NLP.
|
SplitCharactersTokenizer |
This implementation of
Tokenizer is instantiated with an array of
characters that are considered split characters. |
SplitCharactersTokenizer.SplitCharactersSplitterFunction |
Splits tokens at the supplied characters.
|
SplitFunctionTokenizer |
This class supports character-by-character (that is, codepoint-by-codepoint)
iteration over input text to create tokens.
|
SplitPatternTokenizer |
This implementation of
Tokenizer is instantiated with a regular
expression pattern which determines how to split a string into tokens. |
WhitespaceTokenizer |
A simple tokenizer that splits on whitespace.
|
Enum | Description |
---|---|
SplitFunctionTokenizer.SplitResult |
A combination of a
SplitFunctionTokenizer.SplitType and a Token.TokenType . |
SplitFunctionTokenizer.SplitType |
Defines different ways that a tokenizer can split the input text at a given character.
|
Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.