Simple fixed rule tokenizers.
ClassDescriptionA tokenizer wrapping a
BreakIteratorinstance.A convenience class for when you are required to provide a tokenizer but you don't actually want to split up the text into tokens.This tokenizer is loosely based on the notion of word shape which is a common feature used in NLP.This implementation of
Tokenizeris instantiated with an array of characters that are considered split characters.Splits tokens at the supplied characters.This class supports character-by-character (that is, codepoint-by-codepoint) iteration over input text to create tokens.An interface for checking if the text should be split at the supplied codepoint.A combination of a
Token.TokenType.Defines different ways that a tokenizer can split the input text at a given character.This implementation of
Tokenizeris instantiated with a regular expression pattern which determines how to split a string into tokens.A simple tokenizer that splits on whitespace.