public class Wordpiece extends Object implements com.oracle.labs.mlrg.olcut.config.Configurable
Please refer to the class definition for WordpieceTokenizer
. It
does not include any of the tokenization work that is typically performed
before wordpiece is called as is done in the above-referenced implementation.
That functionality is provided by WordpieceTokenizer
and
WordpieceBasicTokenizer
.
Modifier and Type | Field and Description |
---|---|
static String |
DEFAULT_UNKNOWN_TOKEN |
Constructor and Description |
---|
Wordpiece(Set<String> vocab)
Constructs a Wordpiece using the supplied vocab.
|
Wordpiece(Set<String> vocab,
String unknownToken)
Constructs a Wordpiece using the supplied vocabulary and unknown token.
|
Wordpiece(Set<String> vocab,
String unknownToken,
int maxInputCharactersPerWord)
Initializes an instance of Wordpiece with the given vocabulary, unknown
token, and max word length.
|
Wordpiece(String vocabPath)
Constructs a wordpiece by reading the vocabulary from the supplied path.
|
Wordpiece(String vocabPath,
String unknownToken,
int maxInputCharactersPerWord)
Initializes an instance of Wordpiece with the given vocabulary, unknown
token, and max word length.
|
Modifier and Type | Method and Description |
---|---|
int |
getMaxInputCharactersPerWord()
a getter for the maximum character count for a token to consider when
wordpiece(String) is applied to a token. |
String |
getUnknownToken()
a getter for the "unknown" token specified during initialization.
|
void |
postConfig()
Used by the OLCUT configuration system, and should not be called by external code.
|
List<String> |
wordpiece(String token)
Executes Wordpiece tokenization on the provided token.
|
public static final String DEFAULT_UNKNOWN_TOKEN
public Wordpiece(Set<String> vocab)
Sets the unknown token to DEFAULT_UNKNOWN_TOKEN
.
vocab
- The wordpiece vocabulary.public Wordpiece(Set<String> vocab, String unknownToken)
vocab
- The wordpiece vocabulary.unknownToken
- The unknown token.public Wordpiece(Set<String> vocab, String unknownToken, int maxInputCharactersPerWord)
vocab
- the pre-trained wordpiece vocabulary. See
the contents of e.g.,
https://huggingface.co/bert-base-uncased/resolve/main/vocab.txtunknownToken
- a string used to indicate a token was not
found in the vocabulary - typically "[UNK]"maxInputCharactersPerWord
- a maximum to shield against looping over
character-by-character pathologically long
"tokens"public Wordpiece(String vocabPath)
vocabPath
- The path to the wordpiece vocabulary.public Wordpiece(String vocabPath, String unknownToken, int maxInputCharactersPerWord)
vocabPath
- Path to the pre-trained wordpiece vocabulary. See
the contents of e.g.
https://huggingface.co/bert-base-uncased/resolve/main/vocab.txtunknownToken
- a string used to indicate a token was not
found in the vocabulary - typically "[UNK]"maxInputCharactersPerWord
- a maximum to shield against looping over
character-by-character pathologically long
"tokens"public void postConfig() throws IOException
postConfig
in interface com.oracle.labs.mlrg.olcut.config.Configurable
IOException
public List<String> wordpiece(String token)
token
- the token to apply Wordpiece tokenization to.public String getUnknownToken()
public int getMaxInputCharactersPerWord()
wordpiece(String)
is applied to a token. This value is set at
initialization and defaults to 100. Token values passed to that method that
are not tokenized and the result of getUnknownToken()
is returned
instead.wordpiece(String)
.Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.