org.tribuo.util.tokens.universal.UniversalTokenizer

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class UniversalTokenizer extends Object implements Tokenizer

This class was originally written for the purpose of document indexing in an information retrieval context (principally used in Sun Labs' Minion search engine). It was refactored here to implement the Tokenizer interface taking care that the the 'ngram' tokens had correct character offsets. This is typically not required in the document indexing context - but is essential in other kinds of text processing / NLP tasks.

This tokenizer has some specific behavior in how it handles "ngram" characters - i.e., those characters for which isNgram(char) returns true (CJK characters and others). For these characters, it will generate tokens corresponding to character bigrams in addition to tokens corresponding to token unigrams. Most of the other tokenizers will generate tokens that have no overlapping spans but here the character bigram tokens will overlap with the character unigram tokens.

This tokenizer uses bigram tokenization whenever it encounters 'ngram' characters in the CJK range (among others see isNgram(char)). It otherwise tokenizes using punctuation and whitespace separators to separate words. Within runs of 'ngram' characters the tokenizer will generate tokens corresponding to two adjacent characters in addition to tokens corresponding to each character. The tokens corresponding to character bigrams may overlap with the previous and next token. An end-of-line between two 'ngram' characters is ignored (i.e., a character bigram token will be created.)

For example, a sequence of three Chinese characters, 非常感, would tokenize as three WORD type tokens: 非, 常, and 感 and two NGRAM type tokens: 非常 and 常感. Here these tokens will have character offsets that correspond to the character offsets into the text. Here are the tokens listed with their character offsets:

非[0,1]
非常[0,2]
常[1,2]
常感[1,3]
感[2,3]

Field Summary

Fields

Modifier and Type

Field

Description

protected int

maxTokenLength

The length of the longest token that we will generate.
Constructor Summary

Constructors

Constructor

Description

UniversalTokenizer()

Constructs a universal tokenizer which doesn't send punctuation.

UniversalTokenizer(boolean sendPunct)
Method Summary

Modifier and Type

Method

Description

protected void

addChar()

Add a character to the buffer that we're building for a token.

final boolean

advance()

Advances the tokenizer to the next token.

Tokenizer

clone()

Clones a tokenizer with it's configuration.

int

getEnd()

Gets the ending offset (exclusive) of the current token in the character sequence

int

getMaxTokenLength()

int

getPos()

com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance

getProvenance()

int

getStart()

Gets the starting character offset of the current token in the character sequence

String

getText()

Gets the text of the current token, as a string

Token.TokenType

getType()

Gets the type of the current token.

protected void

handleChar()

Handle a character to add to the token buffer.

static boolean

isDigit(char c)

A quick check for whether a character is a digit.

boolean

isGenerateNgrams()

boolean

isGenerateUnigrams()

static boolean

isLetterOrDigit(char c)

A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends.

static boolean

isNgram(char c)

A quick check for a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai).

static boolean

isWhitespace(char c)

A quick check for whether a character is whitespace.

protected void

makeTokens()

Make one or more tokens from our current collected characters.

void

reset(CharSequence cs)

Reset state of tokenizer to clean slate.

void

setGenerateNgrams(boolean generateNgrams)

void

setGenerateUnigrams(boolean generateUnigrams)

void

setMaxTokenLength(int maxTokenLength)

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig

Methods inherited from interface org.tribuo.util.tokens.Tokenizer
getToken, split, tokenize

Field Details
- maxTokenLength
  
  protected int maxTokenLength
  
  The length of the longest token that we will generate.
Constructor Details
- UniversalTokenizer
  
  public UniversalTokenizer(boolean sendPunct)
  
  Parameters:
  
  sendPunct - if sendPunct is true, then the tokenizer will generate punctuation tokens.
- UniversalTokenizer
  
  public UniversalTokenizer()
  
  Constructs a universal tokenizer which doesn't send punctuation.
Method Details
- isLetterOrDigit
  
  public static boolean isLetterOrDigit(char c)
  
  A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends. An approximation of Character.isLetterOrDigit, but is faster and more correct, since it doesn't count the smart quotes as letters.
  
  Parameters:
  
  c - The character to check.
  
  Returns:
  
  True if the input is a letter or digit.
- isDigit
  
  public static boolean isDigit(char c)
  
  A quick check for whether a character is a digit.
  
  Parameters:
  
  c - The character to check
  
  Returns:
  
  True if the input is a digit.
- isWhitespace
  
  public static boolean isWhitespace(char c)
  
  A quick check for whether a character is whitespace.
  
  Parameters:
  
  c - The character to check
  
  Returns:
  
  True if the input is a whitespace character.
- isNgram
  
  public static boolean isNgram(char c)
  
  A quick check for a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai). Uses Unicode Standard Version 2.0.
  
  Parameters:
  
  c - The character to check
  
  Returns:
  
  True if the input character is in a region which is not whitespace separated.
- isGenerateUnigrams
  
  public boolean isGenerateUnigrams()
- setGenerateUnigrams
  
  public void setGenerateUnigrams(boolean generateUnigrams)
- isGenerateNgrams
  
  public boolean isGenerateNgrams()
- setGenerateNgrams
  
  public void setGenerateNgrams(boolean generateNgrams)
- getMaxTokenLength
  
  public int getMaxTokenLength()
- setMaxTokenLength
  
  public void setMaxTokenLength(int maxTokenLength)
- getProvenance
  
  public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
  
  Specified by:
  
  getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
- advance
  
  public final boolean advance()
  
  Description copied from interface: Tokenizer
  
  Advances the tokenizer to the next token.
  
  Specified by:
  
  advance in interface Tokenizer
  
  Returns:
  
  true if there is such a token, false otherwise.
- handleChar
  
  protected void handleChar()
  
  Handle a character to add to the token buffer.
- addChar
  
  protected void addChar()
  
  Add a character to the buffer that we're building for a token.
- getStart
  
  public int getStart()
  
  Description copied from interface: Tokenizer
  
  Gets the starting character offset of the current token in the character sequence
  
  Specified by:
  
  getStart in interface Tokenizer
  
  Returns:
  
  the starting character offset of the token
- getEnd
  
  public int getEnd()
  
  Description copied from interface: Tokenizer
  
  Gets the ending offset (exclusive) of the current token in the character sequence
  
  Specified by:
  
  getEnd in interface Tokenizer
  
  Returns:
  
  the exclusive ending character offset for the current token.
- getText
  
  public String getText()
  
  Description copied from interface: Tokenizer
  
  Gets the text of the current token, as a string
  
  Specified by:
  
  getText in interface Tokenizer
  
  Returns:
  
  the text of the current token
- getType
  
  public Token.TokenType getType()
  
  Description copied from interface: Tokenizer
  
  Gets the type of the current token.
  
  Specified by:
  
  getType in interface Tokenizer
  
  Returns:
  
  the type of the current token.
- getPos
  
  public int getPos()
- clone
  
  public Tokenizer clone()
  
  Description copied from interface: Tokenizer
  
  Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
  
  Specified by:
  
  clone in interface Tokenizer
  
  Overrides:
  
  clone in class Object
  
  Returns:
  
  A tokenizer with the same configuration, but independent state.
- reset
  
  public void reset(CharSequence cs)
  
  Reset state of tokenizer to clean slate.
  
  Specified by:
  
  reset in interface Tokenizer
  
  Parameters:
  
  cs - a character sequence to tokenize
- makeTokens
  
  protected void makeTokens()
  
  Make one or more tokens from our current collected characters.

Class UniversalTokenizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable

Methods inherited from interface org.tribuo.util.tokens.Tokenizer

Field Details

maxTokenLength

Constructor Details

UniversalTokenizer

UniversalTokenizer

Method Details

isLetterOrDigit

isDigit

isWhitespace

isNgram

isGenerateUnigrams

setGenerateUnigrams

isGenerateNgrams

setGenerateNgrams

getMaxTokenLength

setMaxTokenLength

getProvenance

advance

handleChar

addChar

getStart

getEnd

getText

getType

getPos

clone

reset

makeTokens