Class UniversalTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
,Cloneable
,Tokenizer
This tokenizer has some specific behavior in how it handles "ngram"
characters - i.e., those characters for which isNgram(char)
returns
true (CJK characters and others). For these characters, it will generate
tokens corresponding to character bigrams in addition to tokens corresponding
to token unigrams. Most of the other tokenizers will generate tokens that
have no overlapping spans but here the character bigram tokens will overlap
with the character unigram tokens.
This tokenizer uses bigram tokenization whenever it encounters 'ngram'
characters in the CJK range (among others see isNgram(char)
). It
otherwise tokenizes using punctuation and whitespace separators to separate
words. Within runs of 'ngram' characters the tokenizer will generate tokens
corresponding to two adjacent characters in addition to tokens corresponding
to each character. The tokens corresponding to character bigrams may overlap
with the previous and next token. An end-of-line between two 'ngram'
characters is ignored (i.e., a character bigram token will be created.)
For example, a sequence of three Chinese characters, 非常感, would tokenize as three WORD type tokens: 非, 常, and 感 and two NGRAM type tokens: 非常 and 常感. Here these tokens will have character offsets that correspond to the character offsets into the text. Here are the tokens listed with their character offsets:
- 非[0,1]
- 非常[0,2]
- 常[1,2]
- 常感[1,3]
- 感[2,3]
-
Field Summary
Modifier and TypeFieldDescriptionprotected int
The length of the longest token that we will generate. -
Constructor Summary
ConstructorDescriptionConstructs a universal tokenizer which doesn't send punctuation.UniversalTokenizer
(boolean sendPunct) Constructs a universal tokenizer. -
Method Summary
Modifier and TypeMethodDescriptionprotected void
addChar()
Add a character to the buffer that we're building for a token.final boolean
advance()
Advances the tokenizer to the next token.clone()
Clones a tokenizer with it's configuration.int
getEnd()
Gets the ending offset (exclusive) of the current token in the character sequenceint
Returns the maximum token length this tokenizer will generate.int
getPos()
Gets the current position in the input.com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
int
getStart()
Gets the starting character offset of the current token in the character sequencegetText()
Gets the text of the current token, as a stringgetType()
Gets the type of the current token.protected void
Handle a character to add to the token buffer.static boolean
isDigit
(char c) A quick check for whether a character is a digit.boolean
Does this tokenizer generate ngrams?boolean
Does this tokenizer generate unigrams?static boolean
isLetterOrDigit
(char c) A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends.static boolean
isNgram
(char c) A quick check for a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai).static boolean
isWhitespace
(char c) A quick check for whether a character is whitespace.protected void
Make one or more tokens from our current collected characters.void
reset
(CharSequence cs) Reset state of tokenizer to clean slate.void
setGenerateNgrams
(boolean generateNgrams) Controls if the tokenizer generates ngrams.void
setGenerateUnigrams
(boolean generateUnigrams) Controls if the tokenizer generates unigrams.void
setMaxTokenLength
(int maxTokenLength) Sets the maximum token length this tokenizer will generate.Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig
-
Field Details
-
maxTokenLength
protected int maxTokenLengthThe length of the longest token that we will generate.
-
-
Constructor Details
-
UniversalTokenizer
public UniversalTokenizer(boolean sendPunct) Constructs a universal tokenizer.- Parameters:
sendPunct
- if sendPunct is true, then the tokenizer will generate punctuation tokens.
-
UniversalTokenizer
public UniversalTokenizer()Constructs a universal tokenizer which doesn't send punctuation.
-
-
Method Details
-
isLetterOrDigit
public static boolean isLetterOrDigit(char c) A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends. An approximation of Character.isLetterOrDigit, but is faster and more correct, since it doesn't count the smart quotes as letters.- Parameters:
c
- The character to check.- Returns:
- True if the input is a letter or digit.
-
isDigit
public static boolean isDigit(char c) A quick check for whether a character is a digit.- Parameters:
c
- The character to check- Returns:
- True if the input is a digit.
-
isWhitespace
public static boolean isWhitespace(char c) A quick check for whether a character is whitespace.- Parameters:
c
- The character to check- Returns:
- True if the input is a whitespace character.
-
isNgram
public static boolean isNgram(char c) A quick check for a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai). Uses Unicode Standard Version 2.0.- Parameters:
c
- The character to check- Returns:
- True if the input character is in a region which is not whitespace separated.
-
isGenerateUnigrams
public boolean isGenerateUnigrams()Does this tokenizer generate unigrams?- Returns:
- True if the tokenizer generates unigram tokens.
-
setGenerateUnigrams
public void setGenerateUnigrams(boolean generateUnigrams) Controls if the tokenizer generates unigrams.- Parameters:
generateUnigrams
- If true generates unigram tokens.
-
isGenerateNgrams
public boolean isGenerateNgrams()Does this tokenizer generate ngrams?- Returns:
- True if the tokenizer generates ngram tokens.
-
setGenerateNgrams
public void setGenerateNgrams(boolean generateNgrams) Controls if the tokenizer generates ngrams.- Parameters:
generateNgrams
- If true generates ngram tokens.
-
getMaxTokenLength
public int getMaxTokenLength()Returns the maximum token length this tokenizer will generate.- Returns:
- The maximum token length.
-
setMaxTokenLength
public void setMaxTokenLength(int maxTokenLength) Sets the maximum token length this tokenizer will generate.- Parameters:
maxTokenLength
- The maximum token length.
-
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()- Specified by:
getProvenance
in interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
-
advance
public final boolean advance()Description copied from interface:Tokenizer
Advances the tokenizer to the next token. -
handleChar
protected void handleChar()Handle a character to add to the token buffer. -
addChar
protected void addChar()Add a character to the buffer that we're building for a token. -
getStart
public int getStart()Description copied from interface:Tokenizer
Gets the starting character offset of the current token in the character sequence -
getEnd
public int getEnd()Description copied from interface:Tokenizer
Gets the ending offset (exclusive) of the current token in the character sequence -
getText
Description copied from interface:Tokenizer
Gets the text of the current token, as a string -
getType
Description copied from interface:Tokenizer
Gets the type of the current token. -
getPos
public int getPos()Gets the current position in the input.- Returns:
- The current position.
-
clone
Description copied from interface:Tokenizer
Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence. -
reset
Reset state of tokenizer to clean slate. -
makeTokens
protected void makeTokens()Make one or more tokens from our current collected characters.
-