Uses of Interface
org.tribuo.util.tokens.Tokenizer
Packages that use Tokenizer
Package
Description
Provides an implementation of LIME (Locally Interpretable Model Explanations).
Provides implementations of text data processors.
Core definitions for tokenization.
Simple fixed rule tokenizers.
Provides an implementation of a Wordpiece tokenizer which implements
 to the Tribuo 
Tokenizer API.OLCUT 
Options implementations
 which can construct Tokenizers of various types.An implementation of a "universal" tokenizer which will split
 on word boundaries or character boundaries for languages where
 word boundaries are contextual.
- 
Uses of Tokenizer in org.tribuo.classification.explanations.limeConstructors in org.tribuo.classification.explanations.lime with parameters of type TokenizerModifierConstructorDescriptionLIMEColumnar(SplittableRandom rng, Model<Label> innerModel, SparseTrainer<Regressor> explanationTrainer, int numSamples, RowProcessor<Label> exampleGenerator, Tokenizer tokenizer) Constructs a LIME explainer for a model which uses the columnar data processing system.LIMEText(SplittableRandom rng, Model<Label> innerModel, SparseTrainer<Regressor> explanationTrainer, int numSamples, TextFeatureExtractor<Label> extractor, Tokenizer tokenizer) Constructs a LIME explainer for a model which uses text data.
- 
Uses of Tokenizer in org.tribuo.data.text.implConstructors in org.tribuo.data.text.impl with parameters of type TokenizerModifierConstructorDescriptionBasicPipeline(Tokenizer tokenizer, int ngram) Constructs a basic text pipeline which tokenizes the input and generates word n-gram features in the range 1 tongram.NgramProcessor(Tokenizer tokenizer, int n, double value) Creates a processor that will generate token ngrams of sizen.TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting) Creates a new token pipeline.TokenPipeline(Tokenizer tokenizer, int ngram, boolean termCounting, int dimension) Creates a new token pipeline.
- 
Uses of Tokenizer in org.tribuo.util.tokensMethods in org.tribuo.util.tokens that return TokenizerMethods in org.tribuo.util.tokens that return types with arguments of type TokenizerModifier and TypeMethodDescriptionTokenizer.createSupplier(Tokenizer tokenizer) Creates a supplier from the specified tokenizer by cloning it.static ThreadLocal<Tokenizer> Tokenizer.createThreadLocal(Tokenizer tokenizer) Creates a thread local source of tokenizers by making a Tokenizer supplier usingcreateSupplier(Tokenizer).Methods in org.tribuo.util.tokens with parameters of type TokenizerModifier and TypeMethodDescriptionTokenizer.createSupplier(Tokenizer tokenizer) Creates a supplier from the specified tokenizer by cloning it.static ThreadLocal<Tokenizer> Tokenizer.createThreadLocal(Tokenizer tokenizer) Creates a thread local source of tokenizers by making a Tokenizer supplier usingcreateSupplier(Tokenizer).
- 
Uses of Tokenizer in org.tribuo.util.tokens.implClasses in org.tribuo.util.tokens.impl that implement TokenizerModifier and TypeClassDescriptionclassA tokenizer wrapping aBreakIteratorinstance.classA convenience class for when you are required to provide a tokenizer but you don't actually want to split up the text into tokens.classThis tokenizer is loosely based on the notion of word shape which is a common feature used in NLP.classThis implementation ofTokenizeris instantiated with an array of characters that are considered split characters.classThis class supports character-by-character (that is, codepoint-by-codepoint) iteration over input text to create tokens.classThis implementation ofTokenizeris instantiated with a regular expression pattern which determines how to split a string into tokens.classA simple tokenizer that splits on whitespace.Methods in org.tribuo.util.tokens.impl that return Tokenizer
- 
Uses of Tokenizer in org.tribuo.util.tokens.impl.wordpieceClasses in org.tribuo.util.tokens.impl.wordpiece that implement TokenizerModifier and TypeClassDescriptionclassThis is a tokenizer that is used "upstream" ofWordpieceTokenizerand implements much of the functionality of the 'BasicTokenizer' implementation in huggingface.classThis Tokenizer is meant to be a reasonable approximation of the BertTokenizer defined here.Constructors in org.tribuo.util.tokens.impl.wordpiece with parameters of type TokenizerModifierConstructorDescriptionWordpieceTokenizer(Wordpiece wordpiece, Tokenizer tokenizer, boolean toLowerCase, boolean stripAccents, Set<String> neverSplit) Constructs a wordpiece tokenizer.
- 
Uses of Tokenizer in org.tribuo.util.tokens.optionsMethods in org.tribuo.util.tokens.options that return TokenizerModifier and TypeMethodDescriptionBreakIteratorTokenizerOptions.getTokenizer()CoreTokenizerOptions.getTokenizer()SplitCharactersTokenizerOptions.getTokenizer()SplitPatternTokenizerOptions.getTokenizer()TokenizerOptions.getTokenizer()Creates the appropriately configured tokenizer.
- 
Uses of Tokenizer in org.tribuo.util.tokens.universalClasses in org.tribuo.util.tokens.universal that implement TokenizerModifier and TypeClassDescriptionclassThis class was originally written for the purpose of document indexing in an information retrieval context (principally used in Sun Labs' Minion search engine).Methods in org.tribuo.util.tokens.universal that return Tokenizer