org.tribuo.util.tokens.Token.TokenType

All Implemented Interfaces:: Serializable, Comparable<Token.TokenType>, Constable

Enclosing class:: Token

public static enum Token.TokenType extends Enum<Token.TokenType>

Tokenizers may product multiple kinds of tokens, depending on the application to which they're being put. For example, when processing a document for highlighting during querying, we need to send through whitespace and punctuation so that the document looks as it did in it's original form. For most tokenizer applications, they will only send word tokens.

Nested Class Summary

Nested classes/interfaces inherited from class java.lang.Enum
Enum.EnumDesc<E extends Enum<E>>
Enum Constant Summary

Enum Constants

Enum Constant

Description

INFIX

Some tokenizers produce "sub-word" tokens.

NGRAM

An NGRAM corresponds to a token that might correspond to a character ngram - i.e.

PREFIX

Some tokenizers produce "sub-word" tokens.

PUNCTUATION

A PUNCTUATION corresponds to tokens consisting of punctuation characters.

SUFFIX

Some tokenizers produce "sub-word" tokens.

UNKNOWN

Some tokenizers may work in concert with vocabulary data.

WHITESPACE

Some tokenizers may produce tokens corresponding to whitespace (e.g.

WORD

A WORD corresponds to a token that does not consist of or contain whitespace and may correspond to a regular "word" that could be looked up in a dictionary.
Method Summary

Modifier and Type

Method

Description

static Token.TokenType

valueOf(String name)

Returns the enum constant of this class with the specified name.

static Token.TokenType[]

values()

Returns an array containing the constants of this enum class, in the order they are declared.

Methods inherited from class java.lang.Enum
clone, compareTo, describeConstable, equals, finalize, getDeclaringClass, hashCode, name, ordinal, toString, valueOf

Methods inherited from class java.lang.Object
getClass, notify, notifyAll, wait, wait, wait

Enum Constant Details
- WORD
  
  public static final Token.TokenType WORD
  
  A WORD corresponds to a token that does not consist of or contain whitespace and may correspond to a regular "word" that could be looked up in a dictionary. Some tokenizers do not distinguish between different kinds of tokens and may use this as a default type for all generated tokens.
- NGRAM
  
  public static final Token.TokenType NGRAM
  
  An NGRAM corresponds to a token that might correspond to a character ngram - i.e. some portion / sub-span of a regular word token (for example.)
- PUNCTUATION
  
  public static final Token.TokenType PUNCTUATION
  
  A PUNCTUATION corresponds to tokens consisting of punctuation characters. In some applications, a PUNCTUATION may be treated differently because they may have less semantic content than regular word tokens.
- WHITESPACE
  
  public static final Token.TokenType WHITESPACE
  
  Some tokenizers may produce tokens corresponding to whitespace (e.g. space, tab, newline, etc.) It may be important for consumers of tokens generated by a tokenizer to ignore/skip WHITESPACE tokens to avoid unexpected behavior.
- PREFIX
  
  public static final Token.TokenType PREFIX
  
  Some tokenizers produce "sub-word" tokens. A PREFIX corresponds to a sub-word word-prefix token.
- SUFFIX
  
  public static final Token.TokenType SUFFIX
  
  Some tokenizers produce "sub-word" tokens. A SUFFIX corresponds to a sub-word word-suffix token.
- INFIX
  
  public static final Token.TokenType INFIX
  
  Some tokenizers produce "sub-word" tokens. An INFIX corresponds to a sub-word "infix" token (i.e. from the middle).
- UNKNOWN
  
  public static final Token.TokenType UNKNOWN
  
  Some tokenizers may work in concert with vocabulary data. Some applications may treat out-of-vocabulary tokens differently than other tokens. An UNKNOWN token corresponds to a token that is out-of-vocabulary or has never been seen before.
Method Details
- values
  
  public static Token.TokenType[] values()
  
  Returns an array containing the constants of this enum class, in the order they are declared.
  
  Returns:
  
  an array containing the constants of this enum class, in the order they are declared
- valueOf
  
  public static Token.TokenType valueOf(String name)
  
  Returns the enum constant of this class with the specified name. The string must match exactly an identifier used to declare an enum constant in this class. (Extraneous whitespace characters are not permitted.)
  
  Parameters:
  
  name - the name of the enum constant to be returned.
  
  Returns:
  
  the enum constant with the specified name
  
  Throws:
  
  IllegalArgumentException - if this enum class has no constant with the specified name
  
  NullPointerException - if the argument is null

Enum Class Token.TokenType

Nested Class Summary

Nested classes/interfaces inherited from class java.lang.Enum

Enum Constant Summary

Method Summary

Methods inherited from class java.lang.Enum

Methods inherited from class java.lang.Object

Enum Constant Details

WORD

NGRAM

PUNCTUATION

WHITESPACE

PREFIX

SUFFIX

INFIX

UNKNOWN

Method Details

values

valueOf