Package org.tribuo.util.tokens
Enum Class Token.TokenType
- All Implemented Interfaces:
Serializable
,Comparable<Token.TokenType>
,Constable
- Enclosing class:
- Token
Tokenizers may product multiple kinds of tokens, depending on the application
to which they're being put. For example, when processing a document for
highlighting during querying, we need to send through whitespace and
punctuation so that the document looks as it did in it's original form. For
most tokenizer applications, they will only send word tokens.
-
Nested Class Summary
Nested classes/interfaces inherited from class java.lang.Enum
Enum.EnumDesc<E extends Enum<E>>
-
Enum Constant Summary
Enum ConstantDescriptionSome tokenizers produce "sub-word" tokens.An NGRAM corresponds to a token that might correspond to a character ngram - i.e.Some tokenizers produce "sub-word" tokens.A PUNCTUATION corresponds to tokens consisting of punctuation characters.Some tokenizers produce "sub-word" tokens.Some tokenizers may work in concert with vocabulary data.Some tokenizers may produce tokens corresponding to whitespace (e.g.A WORD corresponds to a token that does not consist of or contain whitespace and may correspond to a regular "word" that could be looked up in a dictionary. -
Method Summary
Modifier and TypeMethodDescriptionstatic Token.TokenType
Returns the enum constant of this class with the specified name.static Token.TokenType[]
values()
Returns an array containing the constants of this enum class, in the order they are declared.
-
Enum Constant Details
-
WORD
A WORD corresponds to a token that does not consist of or contain whitespace and may correspond to a regular "word" that could be looked up in a dictionary. Some tokenizers do not distinguish between different kinds of tokens and may use this as a default type for all generated tokens. -
NGRAM
An NGRAM corresponds to a token that might correspond to a character ngram - i.e. some portion / sub-span of a regular word token (for example.) -
PUNCTUATION
A PUNCTUATION corresponds to tokens consisting of punctuation characters. In some applications, a PUNCTUATION may be treated differently because they may have less semantic content than regular word tokens. -
WHITESPACE
Some tokenizers may produce tokens corresponding to whitespace (e.g. space, tab, newline, etc.) It may be important for consumers of tokens generated by a tokenizer to ignore/skip WHITESPACE tokens to avoid unexpected behavior. -
PREFIX
Some tokenizers produce "sub-word" tokens. A PREFIX corresponds to a sub-word word-prefix token. -
SUFFIX
Some tokenizers produce "sub-word" tokens. A SUFFIX corresponds to a sub-word word-suffix token. -
INFIX
Some tokenizers produce "sub-word" tokens. An INFIX corresponds to a sub-word "infix" token (i.e. from the middle). -
UNKNOWN
Some tokenizers may work in concert with vocabulary data. Some applications may treat out-of-vocabulary tokens differently than other tokens. An UNKNOWN token corresponds to a token that is out-of-vocabulary or has never been seen before.
-
-
Method Details
-
values
Returns an array containing the constants of this enum class, in the order they are declared.- Returns:
- an array containing the constants of this enum class, in the order they are declared
-
valueOf
Returns the enum constant of this class with the specified name. The string must match exactly an identifier used to declare an enum constant in this class. (Extraneous whitespace characters are not permitted.)- Parameters:
name
- the name of the enum constant to be returned.- Returns:
- the enum constant with the specified name
- Throws:
IllegalArgumentException
- if this enum class has no constant with the specified nameNullPointerException
- if the argument is null
-