Package org.tribuo.util.tokens.universal
package org.tribuo.util.tokens.universal
An implementation of a "universal" tokenizer which will split
on word boundaries or character boundaries for languages where
word boundaries are contextual.
It was originally developed to support information retrieval and forms a useful baseline tokenizer for generating features for machine learning.
-
ClassDescriptionA range currently being segmented.This class was originally written for the purpose of document indexing in an information retrieval context (principally used in Sun Labs' Minion search engine).