Class WordpieceBasicTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.SplitFunctionTokenizer
org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
,Cloneable
,Tokenizer
This is a tokenizer that is used "upstream" of
WordpieceTokenizer
and
implements much of the functionality of the 'BasicTokenizer'
implementation in huggingface. One minor difference in this implementation is
that there is no set of "never_split" tokens used here. Those are handled by
WordpieceTokenizer
.-
Nested Class Summary
Nested classes/interfaces inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
SplitFunctionTokenizer.SplitFunction, SplitFunctionTokenizer.SplitResult, SplitFunctionTokenizer.SplitType
-
Field Summary
Fields inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
splitFunction
-
Constructor Summary
ConstructorDescriptionConstructs a default tokenizer which tokenizes Chinese characters.WordpieceBasicTokenizer
(boolean tokenizeChineseChars) Constructs a tokenizer. -
Method Summary
Modifier and TypeMethodDescriptionclone()
Clones a tokenizer with it's configuration.createSplitFunction
(boolean tokenizeChineseChars) Creates aSplitFunctionTokenizer.SplitFunction
that is used by the super classSplitFunctionTokenizer
to determine how and where the tokenizer splits the input.com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
static boolean
isChinese
(int codepoint) Determines if the provided codepoint is a Chinese character or not.static boolean
isControl
(int codepoint) Determines if the provided codepoint is a control character or not.static boolean
isPunctuation
(int codepoint) Determines if the input code point should be considered a character that is punctuation.void
Used by the OLCUT configuration system, and should not be called by external code.Methods inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
advance, getEnd, getStart, getText, getType, reset
-
Constructor Details
-
WordpieceBasicTokenizer
public WordpieceBasicTokenizer()Constructs a default tokenizer which tokenizes Chinese characters. -
WordpieceBasicTokenizer
public WordpieceBasicTokenizer(boolean tokenizeChineseChars) Constructs a tokenizer.- Parameters:
tokenizeChineseChars
- Should the Chinese characters be split into individual tokens.
-
-
Method Details
-
createSplitFunction
public static SplitFunctionTokenizer.SplitFunction createSplitFunction(boolean tokenizeChineseChars) Creates aSplitFunctionTokenizer.SplitFunction
that is used by the super classSplitFunctionTokenizer
to determine how and where the tokenizer splits the input.- Parameters:
tokenizeChineseChars
- split Chinese characters into separate tokens?- Returns:
- The splitting function.
-
isPunctuation
public static boolean isPunctuation(int codepoint) Determines if the input code point should be considered a character that is punctuation. This will return true for all ascii characters that are not letters or digits and for any character whose Character type is defined as punctuation. SeeCharacter.getType(int)
.- Parameters:
codepoint
- The codepoint to check.- Returns:
- True if the codepoint is punctuation, false otherwise.
-
isChinese
public static boolean isChinese(int codepoint) Determines if the provided codepoint is a Chinese character or not.- Parameters:
codepoint
- a codepoint- Returns:
- True if the codepoint is a Chinese character, false otherwise.
-
isControl
public static boolean isControl(int codepoint) Determines if the provided codepoint is a control character or not.- Parameters:
codepoint
- The codepoint to check.- Returns:
- True if it's a control character, false otherwise.
-
postConfig
public void postConfig()Used by the OLCUT configuration system, and should not be called by external code. -
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance() -
clone
Description copied from interface:Tokenizer
Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.- Specified by:
clone
in interfaceTokenizer
- Overrides:
clone
in classSplitFunctionTokenizer
- Returns:
- A tokenizer with the same configuration, but independent state.
-