Class WordpieceTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
,Cloneable
,Tokenizer
This Tokenizer is meant to be a reasonable approximation of the BertTokenizer
defined here.
Please see class definition for
BertTokenizer
(the line numbers
may change.) Please see notes in WordpieceTokenizerTest for information about
how we tested the similarity between this tokenizer and the referenced python
implementation and for regression test examples that fail. In short, there
are outstanding discrepancies for texts that include Arabic and other
non-latin scripts that generate so many "[UNK]" tokens for an English-based
BPE vocabulary as to render the discrepancies as practically meaningless.
As in the reference implementation, the input text is whitespace tokenized
and each token is further tokenized to account for things like punctuation
and Chinese characters. The resulting tokens are then applied to the
wordpiece algorithm implemented in Wordpiece
which is driven by an
input vocabulary which matches tokens and token suffixes as it can. Any
tokens that are not found in the input vocbulary are marked as "unknown".
-
Constructor Summary
ConstructorDescriptionWordpieceTokenizer
(Wordpiece wordpiece, Tokenizer tokenizer, boolean toLowerCase, boolean stripAccents, Set<String> neverSplit) Constructs a wordpiece tokenizer. -
Method Summary
Modifier and TypeMethodDescriptionboolean
advance()
Advances the tokenizer to the next token.clone()
Clones a tokenizer with it's configuration.int
getEnd()
Gets the ending offset (exclusive) of the current token in the character sequencecom.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
int
getStart()
Gets the starting character offset of the current token in the character sequencegetText()
Gets the text of the current token, as a stringgetToken()
Generates a Token object from the current state of the tokenizer.getType()
Gets the type of the current token.void
reset
(CharSequence cs) Resets the tokenizer so that it operates on a new sequence of characters.Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig
-
Constructor Details
-
WordpieceTokenizer
public WordpieceTokenizer(Wordpiece wordpiece, Tokenizer tokenizer, boolean toLowerCase, boolean stripAccents, Set<String> neverSplit) Constructs a wordpiece tokenizer.- Parameters:
wordpiece
- an instance ofWordpiece
tokenizer
- Wordpiece is run on the tokens generated by the tokenizer provided here.toLowerCase
- determines whether or not to lowercase each token before running Wordpiece on itstripAccents
- determines whether or not to strip out accents from each token before running Wordpiece on itneverSplit
- a set of token values that we will not apply Wordpiece to.
-
-
Method Details
-
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()- Specified by:
getProvenance
in interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
-
reset
Description copied from interface:Tokenizer
Resets the tokenizer so that it operates on a new sequence of characters. -
advance
public boolean advance()Description copied from interface:Tokenizer
Advances the tokenizer to the next token. -
getToken
Description copied from interface:Tokenizer
Generates a Token object from the current state of the tokenizer. -
getText
Description copied from interface:Tokenizer
Gets the text of the current token, as a string -
getStart
public int getStart()Description copied from interface:Tokenizer
Gets the starting character offset of the current token in the character sequence -
getEnd
public int getEnd()Description copied from interface:Tokenizer
Gets the ending offset (exclusive) of the current token in the character sequence -
getType
Description copied from interface:Tokenizer
Gets the type of the current token. -
clone
Description copied from interface:Tokenizer
Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
-