Class UniversalTokenizer

java.lang.Object
org.tribuo.util.tokens.universal.UniversalTokenizer
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class UniversalTokenizer extends Object implements Tokenizer
This class was originally written for the purpose of document indexing in an information retrieval context (principally used in Sun Labs' Minion search engine). It was refactored here to implement the Tokenizer interface taking care that the the 'ngram' tokens had correct character offsets. This is typically not required in the document indexing context - but is essential in other kinds of text processing / NLP tasks.

This tokenizer has some specific behavior in how it handles "ngram" characters - i.e., those characters for which isNgram(char) returns true (CJK characters and others). For these characters, it will generate tokens corresponding to character bigrams in addition to tokens corresponding to token unigrams. Most of the other tokenizers will generate tokens that have no overlapping spans but here the character bigram tokens will overlap with the character unigram tokens.

This tokenizer uses bigram tokenization whenever it encounters 'ngram' characters in the CJK range (among others see isNgram(char)). It otherwise tokenizes using punctuation and whitespace separators to separate words. Within runs of 'ngram' characters the tokenizer will generate tokens corresponding to two adjacent characters in addition to tokens corresponding to each character. The tokens corresponding to character bigrams may overlap with the previous and next token. An end-of-line between two 'ngram' characters is ignored (i.e., a character bigram token will be created.)

For example, a sequence of three Chinese characters, 非常感, would tokenize as three WORD type tokens: 非, 常, and 感 and two NGRAM type tokens: 非常 and 常感. Here these tokens will have character offsets that correspond to the character offsets into the text. Here are the tokens listed with their character offsets:

  • 非[0,1]
  • 非常[0,2]
  • 常[1,2]
  • 常感[1,3]
  • 感[2,3]
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected int
    The length of the longest token that we will generate.
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructs a universal tokenizer which doesn't send punctuation.
    UniversalTokenizer(boolean sendPunct)
    Constructs a universal tokenizer.
  • Method Summary

    Modifier and Type
    Method
    Description
    protected void
    Add a character to the buffer that we're building for a token.
    final boolean
    Advances the tokenizer to the next token.
    Clones a tokenizer with it's configuration.
    int
    Gets the ending offset (exclusive) of the current token in the character sequence
    int
    Returns the maximum token length this tokenizer will generate.
    int
    Gets the current position in the input.
    com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
     
    int
    Gets the starting character offset of the current token in the character sequence
    Gets the text of the current token, as a string
    Gets the type of the current token.
    protected void
    Handle a character to add to the token buffer.
    static boolean
    isDigit(char c)
    A quick check for whether a character is a digit.
    boolean
    Does this tokenizer generate ngrams?
    boolean
    Does this tokenizer generate unigrams?
    static boolean
    A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends.
    static boolean
    isNgram(char c)
    A quick check for a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai).
    static boolean
    isWhitespace(char c)
    A quick check for whether a character is whitespace.
    protected void
    Make one or more tokens from our current collected characters.
    void
    Reset state of tokenizer to clean slate.
    void
    setGenerateNgrams(boolean generateNgrams)
    Controls if the tokenizer generates ngrams.
    void
    setGenerateUnigrams(boolean generateUnigrams)
    Controls if the tokenizer generates unigrams.
    void
    setMaxTokenLength(int maxTokenLength)
    Sets the maximum token length this tokenizer will generate.

    Methods inherited from class java.lang.Object

    equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable

    postConfig

    Methods inherited from interface org.tribuo.util.tokens.Tokenizer

    getToken, split, tokenize
  • Field Details

    • maxTokenLength

      protected int maxTokenLength
      The length of the longest token that we will generate.
  • Constructor Details

    • UniversalTokenizer

      public UniversalTokenizer(boolean sendPunct)
      Constructs a universal tokenizer.
      Parameters:
      sendPunct - if sendPunct is true, then the tokenizer will generate punctuation tokens.
    • UniversalTokenizer

      public UniversalTokenizer()
      Constructs a universal tokenizer which doesn't send punctuation.
  • Method Details

    • isLetterOrDigit

      public static boolean isLetterOrDigit(char c)
      A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends. An approximation of Character.isLetterOrDigit, but is faster and more correct, since it doesn't count the smart quotes as letters.
      Parameters:
      c - The character to check.
      Returns:
      True if the input is a letter or digit.
    • isDigit

      public static boolean isDigit(char c)
      A quick check for whether a character is a digit.
      Parameters:
      c - The character to check
      Returns:
      True if the input is a digit.
    • isWhitespace

      public static boolean isWhitespace(char c)
      A quick check for whether a character is whitespace.
      Parameters:
      c - The character to check
      Returns:
      True if the input is a whitespace character.
    • isNgram

      public static boolean isNgram(char c)
      A quick check for a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai). Uses Unicode Standard Version 2.0.
      Parameters:
      c - The character to check
      Returns:
      True if the input character is in a region which is not whitespace separated.
    • isGenerateUnigrams

      public boolean isGenerateUnigrams()
      Does this tokenizer generate unigrams?
      Returns:
      True if the tokenizer generates unigram tokens.
    • setGenerateUnigrams

      public void setGenerateUnigrams(boolean generateUnigrams)
      Controls if the tokenizer generates unigrams.
      Parameters:
      generateUnigrams - If true generates unigram tokens.
    • isGenerateNgrams

      public boolean isGenerateNgrams()
      Does this tokenizer generate ngrams?
      Returns:
      True if the tokenizer generates ngram tokens.
    • setGenerateNgrams

      public void setGenerateNgrams(boolean generateNgrams)
      Controls if the tokenizer generates ngrams.
      Parameters:
      generateNgrams - If true generates ngram tokens.
    • getMaxTokenLength

      public int getMaxTokenLength()
      Returns the maximum token length this tokenizer will generate.
      Returns:
      The maximum token length.
    • setMaxTokenLength

      public void setMaxTokenLength(int maxTokenLength)
      Sets the maximum token length this tokenizer will generate.
      Parameters:
      maxTokenLength - The maximum token length.
    • getProvenance

      public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
      Specified by:
      getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
    • advance

      public final boolean advance()
      Description copied from interface: Tokenizer
      Advances the tokenizer to the next token.
      Specified by:
      advance in interface Tokenizer
      Returns:
      true if there is such a token, false otherwise.
    • handleChar

      protected void handleChar()
      Handle a character to add to the token buffer.
    • addChar

      protected void addChar()
      Add a character to the buffer that we're building for a token.
    • getStart

      public int getStart()
      Description copied from interface: Tokenizer
      Gets the starting character offset of the current token in the character sequence
      Specified by:
      getStart in interface Tokenizer
      Returns:
      the starting character offset of the token
    • getEnd

      public int getEnd()
      Description copied from interface: Tokenizer
      Gets the ending offset (exclusive) of the current token in the character sequence
      Specified by:
      getEnd in interface Tokenizer
      Returns:
      the exclusive ending character offset for the current token.
    • getText

      public String getText()
      Description copied from interface: Tokenizer
      Gets the text of the current token, as a string
      Specified by:
      getText in interface Tokenizer
      Returns:
      the text of the current token
    • getType

      public Token.TokenType getType()
      Description copied from interface: Tokenizer
      Gets the type of the current token.
      Specified by:
      getType in interface Tokenizer
      Returns:
      the type of the current token.
    • getPos

      public int getPos()
      Gets the current position in the input.
      Returns:
      The current position.
    • clone

      public Tokenizer clone()
      Description copied from interface: Tokenizer
      Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
      Specified by:
      clone in interface Tokenizer
      Overrides:
      clone in class Object
      Returns:
      A tokenizer with the same configuration, but independent state.
    • reset

      public void reset(CharSequence cs)
      Reset state of tokenizer to clean slate.
      Specified by:
      reset in interface Tokenizer
      Parameters:
      cs - a character sequence to tokenize
    • makeTokens

      protected void makeTokens()
      Make one or more tokens from our current collected characters.