Class Wordpiece
java.lang.Object
org.tribuo.util.tokens.impl.wordpiece.Wordpiece
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
This is vanilla implementation of the Wordpiece algorithm as found here:
https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py
Please refer to the class definition for WordpieceTokenizer
. It
does not include any of the tokenization work that is typically performed
before wordpiece is called as is done in the above-referenced implementation.
That functionality is provided by WordpieceTokenizer
and
WordpieceBasicTokenizer
.
-
Field Summary
-
Constructor Summary
ConstructorDescriptionConstructs a wordpiece by reading the vocabulary from the supplied path.Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.Constructs a Wordpiece using the supplied vocab.Constructs a Wordpiece using the supplied vocabulary and unknown token.Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length. -
Method Summary
Modifier and TypeMethodDescriptionint
a getter for the maximum character count for a token to consider whenwordpiece(String)
is applied to a token.a getter for the "unknown" token specified during initialization.void
Used by the OLCUT configuration system, and should not be called by external code.Executes Wordpiece tokenization on the provided token.
-
Field Details
-
DEFAULT_UNKNOWN_TOKEN
The default unknown token string.- See Also:
-
-
Constructor Details
-
Wordpiece
Constructs a Wordpiece using the supplied vocab.Sets the unknown token to
DEFAULT_UNKNOWN_TOKEN
.- Parameters:
vocab
- The wordpiece vocabulary.
-
Wordpiece
Constructs a Wordpiece using the supplied vocabulary and unknown token.- Parameters:
vocab
- The wordpiece vocabulary.unknownToken
- The unknown token.
-
Wordpiece
Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.- Parameters:
vocab
- the pre-trained wordpiece vocabulary. See the contents of e.g., https://huggingface.co/bert-base-uncased/resolve/main/vocab.txtunknownToken
- a string used to indicate a token was not found in the vocabulary - typically "[UNK]"maxInputCharactersPerWord
- a maximum to shield against looping over character-by-character pathologically long "tokens"
-
Wordpiece
Constructs a wordpiece by reading the vocabulary from the supplied path.- Parameters:
vocabPath
- The path to the wordpiece vocabulary.
-
Wordpiece
Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.- Parameters:
vocabPath
- Path to the pre-trained wordpiece vocabulary. See the contents of e.g. https://huggingface.co/bert-base-uncased/resolve/main/vocab.txtunknownToken
- a string used to indicate a token was not found in the vocabulary - typically "[UNK]"maxInputCharactersPerWord
- a maximum to shield against looping over character-by-character pathologically long "tokens"
-
-
Method Details
-
postConfig
Used by the OLCUT configuration system, and should not be called by external code.- Specified by:
postConfig
in interfacecom.oracle.labs.mlrg.olcut.config.Configurable
- Throws:
IOException
-
wordpiece
Executes Wordpiece tokenization on the provided token. Note that tokens corresponding to word suffixes as indicated in the provided vocabulary with the sequence "##" prepended to the entry may be returned by this method. This method does not perform whitespace tokenization or any other preprocessing. This method does not lowercase the input token or otherwise modify it in any way.- Parameters:
token
- the token to apply Wordpiece tokenization to.- Returns:
- tokens corresponding to Wordpiece tokenization applied to the input text. Some tokens may have a prefix "##" as described above. Some tokens may correspond to an unknown token as specified during initialization (default "[UNK]")
-
getUnknownToken
a getter for the "unknown" token specified during initialization.- Returns:
- the "unknown" token name - defaults to "[UNK]"
-
getMaxInputCharactersPerWord
public int getMaxInputCharactersPerWord()a getter for the maximum character count for a token to consider whenwordpiece(String)
is applied to a token. This value is set at initialization and defaults to 100. Token values passed to that method that are not tokenized and the result ofgetUnknownToken()
is returned instead.- Returns:
- the maximum length of a token that will be analyzed by
wordpiece(String)
.
-