Package | Description |
---|---|
org.tribuo.util.tokens.impl |
Simple fixed rule tokenizers.
|
org.tribuo.util.tokens.impl.wordpiece |
Provides an implementation of a Wordpiece tokenizer which implements
to the Tribuo
Tokenizer API. |
Class and Description |
---|
BreakIteratorTokenizer
A tokenizer wrapping a
BreakIterator instance. |
NonTokenizer
A convenience class for when you are required to provide a tokenizer but you
don't actually want to split up the text into tokens.
|
ShapeTokenizer
This tokenizer is loosely based on the notion of word shape which is a common
feature used in NLP.
|
SplitCharactersTokenizer
This implementation of
Tokenizer is instantiated with an array of
characters that are considered split characters. |
SplitFunctionTokenizer
This class supports character-by-character (that is, codepoint-by-codepoint)
iteration over input text to create tokens.
|
SplitFunctionTokenizer.SplitFunction
An interface for checking if the text should be split at the supplied codepoint.
|
SplitFunctionTokenizer.SplitResult
A combination of a
SplitFunctionTokenizer.SplitType and a Token.TokenType . |
SplitFunctionTokenizer.SplitType
Defines different ways that a tokenizer can split the input text at a given character.
|
SplitPatternTokenizer
This implementation of
Tokenizer is instantiated with a regular
expression pattern which determines how to split a string into tokens. |
WhitespaceTokenizer
A simple tokenizer that splits on whitespace.
|
Class and Description |
---|
SplitFunctionTokenizer
This class supports character-by-character (that is, codepoint-by-codepoint)
iteration over input text to create tokens.
|
SplitFunctionTokenizer.SplitFunction
An interface for checking if the text should be split at the supplied codepoint.
|
Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.