Uses of Interface org.tribuo.util.tokens.Tokenizer (Tribuo 4.3.2 API)

Packages that use Tokenizer

Package

Description

org.tribuo.classification.explanations.lime

Provides an implementation of LIME (Locally Interpretable Model Explanations).

org.tribuo.data.text.impl

Provides implementations of text data processors.

org.tribuo.util.tokens

Core definitions for tokenization.

org.tribuo.util.tokens.impl

Simple fixed rule tokenizers.

org.tribuo.util.tokens.impl.wordpiece

Provides an implementation of a Wordpiece tokenizer which implements to the Tribuo Tokenizer API.

org.tribuo.util.tokens.options

OLCUT Options implementations which can construct Tokenizers of various types.

org.tribuo.util.tokens.universal

An implementation of a "universal" tokenizer which will split on word boundaries or character boundaries for languages where word boundaries are contextual.

Uses of Tokenizer in org.tribuo.classification.explanations.lime

Constructors in org.tribuo.classification.explanations.lime with parameters of type Tokenizer

Modifier

Constructor

Description

LIMEColumnar(SplittableRandom rng, Model<Label> innerModel, SparseTrainer<Regressor> explanationTrainer, int numSamples, RowProcessor<Label> exampleGenerator, Tokenizer tokenizer)

Constructs a LIME explainer for a model which uses the columnar data processing system.

LIMEText(SplittableRandom rng, Model<Label> innerModel, SparseTrainer<Regressor> explanationTrainer, int numSamples, TextFeatureExtractor<Label> extractor, Tokenizer tokenizer)

Constructs a LIME explainer for a model which uses text data.
Uses of Tokenizer in org.tribuo.data.text.impl

Constructors in org.tribuo.data.text.impl with parameters of type Tokenizer

Modifier

Constructor

Description

BasicPipeline(Tokenizer tokenizer, int ngram)

Constructs a basic text pipeline which tokenizes the input and generates word n-gram features in the range 1 to ngram.

NgramProcessor(Tokenizer tokenizer, int n, double value)

Creates a processor that will generate token ngrams of size n.

TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting)

Creates a new token pipeline.

TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension)

Creates a new token pipeline.

TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension, boolean hashPreserveValue)

Creates a new token pipeline.
Uses of Tokenizer in org.tribuo.util.tokens

Methods in org.tribuo.util.tokens that return Tokenizer

Modifier and Type

Method

Description

Tokenizer

Tokenizer.clone()

Clones a tokenizer with it's configuration.

Methods in org.tribuo.util.tokens that return types with arguments of type Tokenizer

Modifier and Type

Method

Description

static Supplier<Tokenizer>

Tokenizer.createSupplier(Tokenizer tokenizer)

Creates a supplier from the specified tokenizer by cloning it.

static ThreadLocal<Tokenizer>

Tokenizer.createThreadLocal(Tokenizer tokenizer)

Creates a thread local source of tokenizers by making a Tokenizer supplier using createSupplier(Tokenizer).

Methods in org.tribuo.util.tokens with parameters of type Tokenizer

Modifier and Type

Method

Description

static Supplier<Tokenizer>

Tokenizer.createSupplier(Tokenizer tokenizer)

Creates a supplier from the specified tokenizer by cloning it.

static ThreadLocal<Tokenizer>

Tokenizer.createThreadLocal(Tokenizer tokenizer)

Creates a thread local source of tokenizers by making a Tokenizer supplier using createSupplier(Tokenizer).
Uses of Tokenizer in org.tribuo.util.tokens.impl

Classes in org.tribuo.util.tokens.impl that implement Tokenizer

Modifier and Type

Class

Description

class

BreakIteratorTokenizer

A tokenizer wrapping a BreakIterator instance.

class

NonTokenizer

A convenience class for when you are required to provide a tokenizer but you don't actually want to split up the text into tokens.

class

ShapeTokenizer

This tokenizer is loosely based on the notion of word shape which is a common feature used in NLP.

class

SplitCharactersTokenizer

This implementation of Tokenizer is instantiated with an array of characters that are considered split characters.

class

SplitFunctionTokenizer

This class supports character-by-character (that is, codepoint-by-codepoint) iteration over input text to create tokens.

class

SplitPatternTokenizer

This implementation of Tokenizer is instantiated with a regular expression pattern which determines how to split a string into tokens.

class

WhitespaceTokenizer

A simple tokenizer that splits on whitespace.

Methods in org.tribuo.util.tokens.impl that return Tokenizer

Modifier and Type

Method

Description

Tokenizer

SplitFunctionTokenizer.clone()
Uses of Tokenizer in org.tribuo.util.tokens.impl.wordpiece

Classes in org.tribuo.util.tokens.impl.wordpiece that implement Tokenizer

Modifier and Type

Class

Description

class

WordpieceBasicTokenizer

This is a tokenizer that is used "upstream" of WordpieceTokenizer and implements much of the functionality of the 'BasicTokenizer' implementation in huggingface.

class

WordpieceTokenizer

This Tokenizer is meant to be a reasonable approximation of the BertTokenizer defined here.

Constructors in org.tribuo.util.tokens.impl.wordpiece with parameters of type Tokenizer

Modifier

Constructor

Description

WordpieceTokenizer(Wordpiece wordpiece, Tokenizer tokenizer, boolean toLowerCase, boolean stripAccents, Set<String> neverSplit)

Constructs a wordpiece tokenizer.
Uses of Tokenizer in org.tribuo.util.tokens.options

Methods in org.tribuo.util.tokens.options that return Tokenizer

Modifier and Type

Method

Description

Tokenizer

BreakIteratorTokenizerOptions.getTokenizer()

Tokenizer

CoreTokenizerOptions.getTokenizer()

Tokenizer

SplitCharactersTokenizerOptions.getTokenizer()

Tokenizer

SplitPatternTokenizerOptions.getTokenizer()

Tokenizer

TokenizerOptions.getTokenizer()

Creates the appropriately configured tokenizer.
Uses of Tokenizer in org.tribuo.util.tokens.universal

Classes in org.tribuo.util.tokens.universal that implement Tokenizer

Modifier and Type

Class

Description

class

UniversalTokenizer

This class was originally written for the purpose of document indexing in an information retrieval context (principally used in Sun Labs' Minion search engine).

Methods in org.tribuo.util.tokens.universal that return Tokenizer

Modifier and Type

Method

Description

Tokenizer

UniversalTokenizer.clone()

Uses of Interfaceorg.tribuo.util.tokens.Tokenizer

Uses of Tokenizer in org.tribuo.classification.explanations.lime

Uses of Tokenizer in org.tribuo.data.text.impl

Uses of Tokenizer in org.tribuo.util.tokens

Uses of Tokenizer in org.tribuo.util.tokens.impl

Uses of Tokenizer in org.tribuo.util.tokens.impl.wordpiece

Uses of Tokenizer in org.tribuo.util.tokens.options

Uses of Tokenizer in org.tribuo.util.tokens.universal

Uses of Interface
org.tribuo.util.tokens.Tokenizer