public class UniversalTokenizer extends Object implements Tokenizer
This tokenizer has some specific behavior in how it handles "ngram"
characters - i.e., those characters for which isNgram(char)
returns
true (CJK characters and others). For these characters, it will generate
tokens corresponding to character bigrams in addition to tokens corresponding
to token unigrams. Most of the other tokenizers will generate tokens that
have no overlapping spans but here the character bigram tokens will overlap
with the character unigram tokens.
This tokenizer uses bigram tokenization whenever it encounters 'ngram'
characters in the CJK range (among others see isNgram(char)
). It
otherwise tokenizes using punctuation and whitespace separators to separate
words. Within runs of 'ngram' characters the tokenizer will generate tokens
corresponding to two adjacent characters in addition to tokens corresponding
to each character. The tokens corresponding to character bigrams may overlap
with the previous and next token. An end-of-line between two 'ngram'
characters is ignored (i.e., a character bigram token will be created.)
For example, a sequence of three Chinese characters, 非常感, would tokenize as three WORD type tokens: 非, 常, and 感 and two NGRAM type tokens: 非常 and 常感. Here these tokens will have character offsets that correspond to the character offsets into the text. Here are the tokens listed with their character offsets:
Modifier and Type | Field and Description |
---|---|
protected int |
maxTokenLength
The length of the longest token that we will generate.
|
Constructor and Description |
---|
UniversalTokenizer()
Constructs a universal tokenizer which doesn't send punctuation.
|
UniversalTokenizer(boolean sendPunct) |
Modifier and Type | Method and Description |
---|---|
protected void |
addChar()
Add a character to the buffer that we're building for a token.
|
boolean |
advance()
Advances the tokenizer to the next token.
|
Tokenizer |
clone()
Clones a tokenizer with it's configuration.
|
int |
getEnd()
Gets the ending offset (exclusive) of the current token in the character
sequence
|
int |
getMaxTokenLength() |
int |
getPos() |
com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance |
getProvenance() |
int |
getStart()
Gets the starting character offset of the current token in the character
sequence
|
String |
getText()
Gets the text of the current token, as a string
|
Token.TokenType |
getType()
Gets the type of the current token.
|
protected void |
handleChar()
Handle a character to add to the token buffer.
|
static boolean |
isDigit(char c)
A quick check for whether a character is a digit.
|
boolean |
isGenerateNgrams() |
boolean |
isGenerateUnigrams() |
static boolean |
isLetterOrDigit(char c)
A quick check for whether a character should be kept in a word or should
be removed from the word if it occurs at one of the ends.
|
static boolean |
isNgram(char c)
A quick check for a character in a language that may not separate words
with whitespace (includes Arabic, CJK, and Thai).
|
static boolean |
isWhitespace(char c)
A quick check for whether a character is whitespace.
|
protected void |
makeTokens()
Make one or more tokens from our current collected characters.
|
void |
reset(CharSequence cs)
Reset state of tokenizer to clean slate.
|
void |
setGenerateNgrams(boolean generateNgrams) |
void |
setGenerateUnigrams(boolean generateUnigrams) |
void |
setMaxTokenLength(int maxTokenLength) |
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
createSupplier, createThreadLocal, getToken, split, tokenize
protected int maxTokenLength
public UniversalTokenizer(boolean sendPunct)
sendPunct
- if sendPunct is true, then the tokenizer will generate punctuation tokens.public UniversalTokenizer()
public static boolean isLetterOrDigit(char c)
c
- The character to check.public static boolean isDigit(char c)
c
- The character to checkpublic static boolean isWhitespace(char c)
c
- The character to checkpublic static boolean isNgram(char c)
c
- The character to checkpublic boolean isGenerateUnigrams()
public void setGenerateUnigrams(boolean generateUnigrams)
public boolean isGenerateNgrams()
public void setGenerateNgrams(boolean generateNgrams)
public int getMaxTokenLength()
public void setMaxTokenLength(int maxTokenLength)
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
getProvenance
in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
public final boolean advance()
Tokenizer
protected void handleChar()
protected void addChar()
public int getStart()
Tokenizer
public int getEnd()
Tokenizer
public String getText()
Tokenizer
public Token.TokenType getType()
Tokenizer
public int getPos()
public Tokenizer clone()
Tokenizer
public void reset(CharSequence cs)
protected void makeTokens()
Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.