Class UniversalTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>,Cloneable,Tokenizer
This tokenizer has some specific behavior in how it handles "ngram"
characters - i.e., those characters for which isNgram(char) returns
true (CJK characters and others). For these characters, it will generate
tokens corresponding to character bigrams in addition to tokens corresponding
to token unigrams. Most of the other tokenizers will generate tokens that
have no overlapping spans but here the character bigram tokens will overlap
with the character unigram tokens.
This tokenizer uses bigram tokenization whenever it encounters 'ngram'
characters in the CJK range (among others see isNgram(char)). It
otherwise tokenizes using punctuation and whitespace separators to separate
words. Within runs of 'ngram' characters the tokenizer will generate tokens
corresponding to two adjacent characters in addition to tokens corresponding
to each character. The tokens corresponding to character bigrams may overlap
with the previous and next token. An end-of-line between two 'ngram'
characters is ignored (i.e., a character bigram token will be created.)
For example, a sequence of three Chinese characters, 非常感, would tokenize as three WORD type tokens: 非, 常, and 感 and two NGRAM type tokens: 非常 and 常感. Here these tokens will have character offsets that correspond to the character offsets into the text. Here are the tokens listed with their character offsets:
- 非[0,1]
- 非常[0,2]
- 常[1,2]
- 常感[1,3]
- 感[2,3]
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected intThe length of the longest token that we will generate. -
Constructor Summary
ConstructorsConstructorDescriptionConstructs a universal tokenizer which doesn't send punctuation.UniversalTokenizer(boolean sendPunct) -
Method Summary
Modifier and TypeMethodDescriptionprotected voidaddChar()Add a character to the buffer that we're building for a token.final booleanadvance()Advances the tokenizer to the next token.clone()Clones a tokenizer with it's configuration.intgetEnd()Gets the ending offset (exclusive) of the current token in the character sequenceintintgetPos()com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenanceintgetStart()Gets the starting character offset of the current token in the character sequencegetText()Gets the text of the current token, as a stringgetType()Gets the type of the current token.protected voidHandle a character to add to the token buffer.static booleanisDigit(char c) A quick check for whether a character is a digit.booleanbooleanstatic booleanisLetterOrDigit(char c) A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends.static booleanisNgram(char c) A quick check for a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai).static booleanisWhitespace(char c) A quick check for whether a character is whitespace.protected voidMake one or more tokens from our current collected characters.voidreset(CharSequence cs) Reset state of tokenizer to clean slate.voidsetGenerateNgrams(boolean generateNgrams) voidsetGenerateUnigrams(boolean generateUnigrams) voidsetMaxTokenLength(int maxTokenLength) Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig
-
Field Details
-
maxTokenLength
The length of the longest token that we will generate.
-
-
Constructor Details
-
UniversalTokenizer
- Parameters:
sendPunct- if sendPunct is true, then the tokenizer will generate punctuation tokens.
-
UniversalTokenizer
public UniversalTokenizer()Constructs a universal tokenizer which doesn't send punctuation.
-
-
Method Details
-
isLetterOrDigit
A quick check for whether a character should be kept in a word or should be removed from the word if it occurs at one of the ends. An approximation of Character.isLetterOrDigit, but is faster and more correct, since it doesn't count the smart quotes as letters.- Parameters:
c- The character to check.- Returns:
- True if the input is a letter or digit.
-
isDigit
A quick check for whether a character is a digit.- Parameters:
c- The character to check- Returns:
- True if the input is a digit.
-
isWhitespace
A quick check for whether a character is whitespace.- Parameters:
c- The character to check- Returns:
- True if the input is a whitespace character.
-
isNgram
A quick check for a character in a language that may not separate words with whitespace (includes Arabic, CJK, and Thai). Uses Unicode Standard Version 2.0.- Parameters:
c- The character to check- Returns:
- True if the input character is in a region which is not whitespace separated.
-
isGenerateUnigrams
-
setGenerateUnigrams
-
isGenerateNgrams
-
setGenerateNgrams
-
getMaxTokenLength
-
setMaxTokenLength
-
getProvenance
- Specified by:
getProvenancein interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
-
advance
-
handleChar
Handle a character to add to the token buffer. -
addChar
Add a character to the buffer that we're building for a token. -
getStart
-
getEnd
-
getText
-
getType
Description copied from interface:TokenizerGets the type of the current token. -
getPos
-
clone
Description copied from interface:TokenizerClones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence. -
reset
Reset state of tokenizer to clean slate. -
makeTokens
Make one or more tokens from our current collected characters.
-