org.tribuo.util.tokens.impl (Tribuo 4.1.1 API)

package org.tribuo.util.tokens.impl

Simple fixed rule tokenizers.

Related Packages

Package

Description

org.tribuo.util.tokens

Core definitions for tokenization.

org.tribuo.util.tokens.impl.wordpiece

Provides an implementation of a Wordpiece tokenizer which implements to the Tribuo Tokenizer API.

org.tribuo.util.tokens.options

OLCUT Options implementations which can construct Tokenizers of various types.

org.tribuo.util.tokens.universal

An implementation of a "universal" tokenizer which will split on word boundaries or character boundaries for languages where word boundaries are contextual.
Class

Description

BreakIteratorTokenizer

A tokenizer wrapping a BreakIterator instance.

NonTokenizer

A convenience class for when you are required to provide a tokenizer but you don't actually want to split up the text into tokens.

ShapeTokenizer

This tokenizer is loosely based on the notion of word shape which is a common feature used in NLP.

SplitCharactersTokenizer

This implementation of Tokenizer is instantiated with an array of characters that are considered split characters.

SplitCharactersTokenizer.SplitCharactersSplitterFunction

Splits tokens at the supplied characters.

SplitFunctionTokenizer

This class supports character-by-character (that is, codepoint-by-codepoint) iteration over input text to create tokens.

SplitFunctionTokenizer.SplitFunction

An interface for checking if the text should be split at the supplied codepoint.

SplitFunctionTokenizer.SplitResult

A combination of a SplitFunctionTokenizer.SplitType and a Token.TokenType.

SplitFunctionTokenizer.SplitType

Defines different ways that a tokenizer can split the input text at a given character.

SplitPatternTokenizer

This implementation of Tokenizer is instantiated with a regular expression pattern which determines how to split a string into tokens.

WhitespaceTokenizer

A simple tokenizer that splits on whitespace.

Package org.tribuo.util.tokens.impl