Enum Class Token.TokenType

java.lang.Object
java.lang.Enum<Token.TokenType>
org.tribuo.util.tokens.Token.TokenType
All Implemented Interfaces:
Serializable, Comparable<Token.TokenType>, Constable
Enclosing class:
Token

public static enum Token.TokenType extends Enum<Token.TokenType>
Tokenizers may product multiple kinds of tokens, depending on the application to which they're being put. For example, when processing a document for highlighting during querying, we need to send through whitespace and punctuation so that the document looks as it did in it's original form. For most tokenizer applications, they will only send word tokens.
  • Nested Class Summary

    Nested classes/interfaces inherited from class java.lang.Enum

    Enum.EnumDesc<E extends Enum<E>>
  • Enum Constant Summary

    Enum Constants
    Enum Constant
    Description
    Some tokenizers produce "sub-word" tokens.
    An NGRAM corresponds to a token that might correspond to a character ngram - i.e.
    Some tokenizers produce "sub-word" tokens.
    A PUNCTUATION corresponds to tokens consisting of punctuation characters.
    Some tokenizers produce "sub-word" tokens.
    Some tokenizers may work in concert with vocabulary data.
    Some tokenizers may produce tokens corresponding to whitespace (e.g.
    A WORD corresponds to a token that does not consist of or contain whitespace and may correspond to a regular "word" that could be looked up in a dictionary.
  • Method Summary

    Modifier and Type
    Method
    Description
    Returns the enum constant of this class with the specified name.
    Returns an array containing the constants of this enum class, in the order they are declared.

    Methods inherited from class java.lang.Object

    getClass, notify, notifyAll, wait, wait, wait
  • Enum Constant Details

    • WORD

      public static final Token.TokenType WORD
      A WORD corresponds to a token that does not consist of or contain whitespace and may correspond to a regular "word" that could be looked up in a dictionary. Some tokenizers do not distinguish between different kinds of tokens and may use this as a default type for all generated tokens.
    • NGRAM

      public static final Token.TokenType NGRAM
      An NGRAM corresponds to a token that might correspond to a character ngram - i.e. some portion / sub-span of a regular word token (for example.)
    • PUNCTUATION

      public static final Token.TokenType PUNCTUATION
      A PUNCTUATION corresponds to tokens consisting of punctuation characters. In some applications, a PUNCTUATION may be treated differently because they may have less semantic content than regular word tokens.
    • WHITESPACE

      public static final Token.TokenType WHITESPACE
      Some tokenizers may produce tokens corresponding to whitespace (e.g. space, tab, newline, etc.) It may be important for consumers of tokens generated by a tokenizer to ignore/skip WHITESPACE tokens to avoid unexpected behavior.
    • PREFIX

      public static final Token.TokenType PREFIX
      Some tokenizers produce "sub-word" tokens. A PREFIX corresponds to a sub-word word-prefix token.
    • SUFFIX

      public static final Token.TokenType SUFFIX
      Some tokenizers produce "sub-word" tokens. A SUFFIX corresponds to a sub-word word-suffix token.
    • INFIX

      public static final Token.TokenType INFIX
      Some tokenizers produce "sub-word" tokens. An INFIX corresponds to a sub-word "infix" token (i.e. from the middle).
    • UNKNOWN

      public static final Token.TokenType UNKNOWN
      Some tokenizers may work in concert with vocabulary data. Some applications may treat out-of-vocabulary tokens differently than other tokens. An UNKNOWN token corresponds to a token that is out-of-vocabulary or has never been seen before.
  • Method Details

    • values

      public static Token.TokenType[] values()
      Returns an array containing the constants of this enum class, in the order they are declared.
      Returns:
      an array containing the constants of this enum class, in the order they are declared
    • valueOf

      public static Token.TokenType valueOf(String name)
      Returns the enum constant of this class with the specified name. The string must match exactly an identifier used to declare an enum constant in this class. (Extraneous whitespace characters are not permitted.)
      Parameters:
      name - the name of the enum constant to be returned.
      Returns:
      the enum constant with the specified name
      Throws:
      IllegalArgumentException - if this enum class has no constant with the specified name
      NullPointerException - if the argument is null