Class Wordpiece

java.lang.Object
org.tribuo.util.tokens.impl.wordpiece.Wordpiece
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable

public class Wordpiece extends Object implements com.oracle.labs.mlrg.olcut.config.Configurable
This is vanilla implementation of the Wordpiece algorithm as found here: https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py

Please refer to the class definition for WordpieceTokenizer. It does not include any of the tokenization work that is typically performed before wordpiece is called as is done in the above-referenced implementation. That functionality is provided by WordpieceTokenizer and WordpieceBasicTokenizer.

  • Field Details

    • DEFAULT_UNKNOWN_TOKEN

      public static final String DEFAULT_UNKNOWN_TOKEN
      The default unknown token string.
      See Also:
  • Constructor Details

    • Wordpiece

      public Wordpiece(Set<String> vocab)
      Constructs a Wordpiece using the supplied vocab.

      Sets the unknown token to DEFAULT_UNKNOWN_TOKEN.

      Parameters:
      vocab - The wordpiece vocabulary.
    • Wordpiece

      public Wordpiece(Set<String> vocab, String unknownToken)
      Constructs a Wordpiece using the supplied vocabulary and unknown token.
      Parameters:
      vocab - The wordpiece vocabulary.
      unknownToken - The unknown token.
    • Wordpiece

      public Wordpiece(Set<String> vocab, String unknownToken, int maxInputCharactersPerWord)
      Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.
      Parameters:
      vocab - the pre-trained wordpiece vocabulary. See the contents of e.g., https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
      unknownToken - a string used to indicate a token was not found in the vocabulary - typically "[UNK]"
      maxInputCharactersPerWord - a maximum to shield against looping over character-by-character pathologically long "tokens"
    • Wordpiece

      public Wordpiece(String vocabPath)
      Constructs a wordpiece by reading the vocabulary from the supplied path.
      Parameters:
      vocabPath - The path to the wordpiece vocabulary.
    • Wordpiece

      public Wordpiece(String vocabPath, String unknownToken, int maxInputCharactersPerWord)
      Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.
      Parameters:
      vocabPath - Path to the pre-trained wordpiece vocabulary. See the contents of e.g. https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
      unknownToken - a string used to indicate a token was not found in the vocabulary - typically "[UNK]"
      maxInputCharactersPerWord - a maximum to shield against looping over character-by-character pathologically long "tokens"
  • Method Details

    • postConfig

      public void postConfig() throws IOException
      Used by the OLCUT configuration system, and should not be called by external code.
      Specified by:
      postConfig in interface com.oracle.labs.mlrg.olcut.config.Configurable
      Throws:
      IOException
    • wordpiece

      public List<String> wordpiece(String token)
      Executes Wordpiece tokenization on the provided token. Note that tokens corresponding to word suffixes as indicated in the provided vocabulary with the sequence "##" prepended to the entry may be returned by this method. This method does not perform whitespace tokenization or any other preprocessing. This method does not lowercase the input token or otherwise modify it in any way.
      Parameters:
      token - the token to apply Wordpiece tokenization to.
      Returns:
      tokens corresponding to Wordpiece tokenization applied to the input text. Some tokens may have a prefix "##" as described above. Some tokens may correspond to an unknown token as specified during initialization (default "[UNK]")
    • getUnknownToken

      public String getUnknownToken()
      a getter for the "unknown" token specified during initialization.
      Returns:
      the "unknown" token name - defaults to "[UNK]"
    • getMaxInputCharactersPerWord

      public int getMaxInputCharactersPerWord()
      a getter for the maximum character count for a token to consider when wordpiece(String) is applied to a token. This value is set at initialization and defaults to 100. Token values passed to that method that are not tokenized and the result of getUnknownToken() is returned instead.
      Returns:
      the maximum length of a token that will be analyzed by wordpiece(String).