- All Implemented Interfaces:
BertTokenizer(the line numbers may change.) Please see notes in WordpieceTokenizerTest for information about how we tested the similarity between this tokenizer and the referenced python implementation and for regression test examples that fail. In short, there are outstanding discrepancies for texts that include Arabic and other non-latin scripts that generate so many "[UNK]" tokens for an English-based BPE vocabulary as to render the discrepancies as practically meaningless.
As in the reference implementation, the input text is whitespace tokenized
and each token is further tokenized to account for things like punctuation
and Chinese characters. The resulting tokens are then applied to the
wordpiece algorithm implemented in
Wordpiece which is driven by an
input vocabulary which matches tokens and token suffixes as it can. Any
tokens that are not found in the input vocbulary are marked as "unknown".
Method SummaryModifier and TypeMethodDescription
advance()Advances the tokenizer to the next token.
clone()Clones a tokenizer with it's configuration.
getEnd()Gets the ending offset (exclusive) of the current token in the character sequence
getStart()Gets the starting character offset of the current token in the character sequence
getText()Gets the text of the current token, as a string
getToken()Generates a Token object from the current state of the tokenizer.
getType()Gets the type of the current token.
voidResets the tokenizer so that it operates on a new sequence of characters.
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
(Wordpiece wordpiece, Tokenizer tokenizer, boolean toLowerCase, boolean stripAccents, Set<String> neverSplit)Constructs a wordpiece tokenizer.
wordpiece- an instance of
tokenizer- Wordpiece is run on the tokens generated by the tokenizer provided here.
toLowerCase- determines whether or not to lowercase each token before running Wordpiece on it
stripAccents- determines whether or not to strip out accents from each token before running Wordpiece on it
neverSplit- a set of token values that we will not apply Wordpiece to.
getProvenancepublic com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
- Specified by:
resetpublic void reset
(CharSequence cs)Resets the tokenizer so that it operates on a new sequence of characters.
advancepublic boolean advance()Advances the tokenizer to the next token.
getTokenpublic Token getToken()Generates a Token object from the current state of the tokenizer.
getTextpublic String getText()Gets the text of the current token, as a string
getStartpublic int getStart()Gets the starting character offset of the current token in the character sequence
getEndpublic int getEnd()Gets the ending offset (exclusive) of the current token in the character sequence
getTypepublic Token.TokenType getType()Gets the type of the current token.
clonepublic WordpieceTokenizer clone()Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.