public class WordpieceTokenizer extends Object implements Tokenizer
BertTokenizer
(the line numbers
may change.) Please see notes in WordpieceTokenizerTest for information about
how we tested the similarity between this tokenizer and the referenced python
implementation and for regression test examples that fail. In short, there
are outstanding discrepancies for texts that include Arabic and other
non-latin scripts that generate so many "[UNK]" tokens for an English-based
BPE vocabulary as to render the discrepancies as practically meaningless.
As in the reference implementation, the input text is whitespace tokenized
and each token is further tokenized to account for things like punctuation
and Chinese characters. The resulting tokens are then applied to the
wordpiece algorithm implemented in Wordpiece
which is driven by an
input vocabulary which matches tokens and token suffixes as it can. Any
tokens that are not found in the input vocbulary are marked as "unknown".
Constructor and Description |
---|
WordpieceTokenizer(Wordpiece wordpiece,
Tokenizer tokenizer,
boolean toLowerCase,
boolean stripAccents,
Set<String> neverSplit)
Constructs a wordpiece tokenizer.
|
Modifier and Type | Method and Description |
---|---|
boolean |
advance()
Advances the tokenizer to the next token.
|
WordpieceTokenizer |
clone()
Clones a tokenizer with it's configuration.
|
int |
getEnd()
Gets the ending offset (exclusive) of the current token in the character
sequence
|
com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance |
getProvenance() |
int |
getStart()
Gets the starting character offset of the current token in the character
sequence
|
String |
getText()
Gets the text of the current token, as a string
|
Token |
getToken()
Generates a Token object from the current state of the tokenizer.
|
Token.TokenType |
getType()
Gets the type of the current token.
|
void |
reset(CharSequence cs)
Resets the tokenizer so that it operates on a new sequence of characters.
|
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
createSupplier, createThreadLocal, split, tokenize
public WordpieceTokenizer(Wordpiece wordpiece, Tokenizer tokenizer, boolean toLowerCase, boolean stripAccents, Set<String> neverSplit)
wordpiece
- an instance of Wordpiece
tokenizer
- Wordpiece is run on the tokens generated by the
tokenizer provided here.toLowerCase
- determines whether or not to lowercase each token
before running Wordpiece on itstripAccents
- determines whether or not to strip out accents from
each token before running Wordpiece on itneverSplit
- a set of token values that we will not apply
Wordpiece to.public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
getProvenance
in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
public void reset(CharSequence cs)
Tokenizer
public boolean advance()
Tokenizer
public Token getToken()
Tokenizer
public String getText()
Tokenizer
public int getStart()
Tokenizer
public int getEnd()
Tokenizer
public Token.TokenType getType()
Tokenizer
public WordpieceTokenizer clone()
Tokenizer
Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.