public class WordpieceBasicTokenizer extends SplitFunctionTokenizer
WordpieceTokenizer
and
implements much of the functionality of the 'BasicTokenizer'
implementation in huggingface. One minor difference in this implementation is
that there is no set of "never_split" tokens used here. Those are handled by
WordpieceTokenizer
.SplitFunctionTokenizer.SplitFunction, SplitFunctionTokenizer.SplitResult, SplitFunctionTokenizer.SplitType
splitFunction
Constructor and Description |
---|
WordpieceBasicTokenizer()
Constructs a default tokenizer which tokenizes Chinese characters.
|
WordpieceBasicTokenizer(boolean tokenizeChineseChars)
Constructs a tokenizer.
|
Modifier and Type | Method and Description |
---|---|
WordpieceBasicTokenizer |
clone()
Clones a tokenizer with it's configuration.
|
static SplitFunctionTokenizer.SplitFunction |
createSplitFunction(boolean tokenizeChineseChars)
Creates a
SplitFunction that is used by the super class
SplitFunctionTokenizer to determine how and where the tokenizer
splits the input. |
com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance |
getProvenance() |
static boolean |
isChinese(int codepoint)
Determines if the provided codepoint is a Chinese character or not.
|
static boolean |
isControl(int codepoint)
Determines if the provided codepoint is a control character or not.
|
static boolean |
isPunctuation(int codepoint)
Determines if the input code point should be considered a character that is punctuation.
|
void |
postConfig()
Used by the OLCUT configuration system, and should not be called by external code.
|
advance, getEnd, getStart, getText, getType, reset
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
createSupplier, createThreadLocal, getToken, split, tokenize
public WordpieceBasicTokenizer()
public WordpieceBasicTokenizer(boolean tokenizeChineseChars)
tokenizeChineseChars
- Should the Chinese characters be split into individual tokens.public static SplitFunctionTokenizer.SplitFunction createSplitFunction(boolean tokenizeChineseChars)
SplitFunction
that is used by the super class
SplitFunctionTokenizer
to determine how and where the tokenizer
splits the input.tokenizeChineseChars
- split Chinese characters into separate tokens?public static boolean isPunctuation(int codepoint)
Character.getType(int)
.codepoint
- The codepoint to check.public static boolean isChinese(int codepoint)
codepoint
- a codepointpublic static boolean isControl(int codepoint)
codepoint
- The codepoint to check.public void postConfig()
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
public WordpieceBasicTokenizer clone()
Tokenizer
clone
in interface Tokenizer
clone
in class SplitFunctionTokenizer
Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.