org.tribuo.util.tokens.impl.wordpiece.Wordpiece

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable

public class Wordpiece extends Object implements com.oracle.labs.mlrg.olcut.config.Configurable

This is vanilla implementation of the Wordpiece algorithm as found here: https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py

Please refer to the class definition for WordpieceTokenizer. It does not include any of the tokenization work that is typically performed before wordpiece is called as is done in the above-referenced implementation. That functionality is provided by WordpieceTokenizer and WordpieceBasicTokenizer.

Field Summary

Fields

Modifier and Type

Field

Description

static final String

DEFAULT_UNKNOWN_TOKEN
Constructor Summary

Constructors

Constructor

Description

Wordpiece(String vocabPath)

Constructs a wordpiece by reading the vocabulary from the supplied path.

Wordpiece(String vocabPath, String unknownToken, int maxInputCharactersPerWord)

Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.

Wordpiece(Set<String> vocab)

Constructs a Wordpiece using the supplied vocab.

Wordpiece(Set<String> vocab, String unknownToken)

Constructs a Wordpiece using the supplied vocabulary and unknown token.

Wordpiece(Set<String> vocab, String unknownToken, int maxInputCharactersPerWord)

Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.
Method Summary

Modifier and Type

Method

Description

int

getMaxInputCharactersPerWord()

a getter for the maximum character count for a token to consider when wordpiece(String) is applied to a token.

String

getUnknownToken()

a getter for the "unknown" token specified during initialization.

void

postConfig()

Used by the OLCUT configuration system, and should not be called by external code.

List<String>

wordpiece(String token)

Executes Wordpiece tokenization on the provided token.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DEFAULT_UNKNOWN_TOKEN
  public static final String DEFAULT_UNKNOWN_TOKEN
  
  See Also:
  
  Constant Field Values
Constructor Details
- Wordpiece
  
  public Wordpiece(Set<String> vocab)
  
  Constructs a Wordpiece using the supplied vocab.
  Sets the unknown token to DEFAULT_UNKNOWN_TOKEN.
  
  Parameters:
  
  vocab - The wordpiece vocabulary.
- Wordpiece
  
  public Wordpiece(Set<String> vocab, String unknownToken)
  
  Constructs a Wordpiece using the supplied vocabulary and unknown token.
  
  Parameters:
  
  vocab - The wordpiece vocabulary.
  
  unknownToken - The unknown token.
- Wordpiece
  
  public Wordpiece(Set<String> vocab, String unknownToken, int maxInputCharactersPerWord)
  
  Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.
  
  Parameters:
  
  vocab - the pre-trained wordpiece vocabulary. See the contents of e.g., https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
  
  unknownToken - a string used to indicate a token was not found in the vocabulary - typically "[UNK]"
  
  maxInputCharactersPerWord - a maximum to shield against looping over character-by-character pathologically long "tokens"
- Wordpiece
  
  public Wordpiece(String vocabPath)
  
  Constructs a wordpiece by reading the vocabulary from the supplied path.
  
  Parameters:
  
  vocabPath - The path to the wordpiece vocabulary.
- Wordpiece
  
  public Wordpiece(String vocabPath, String unknownToken, int maxInputCharactersPerWord)
  
  Initializes an instance of Wordpiece with the given vocabulary, unknown token, and max word length.
  
  Parameters:
  
  vocabPath - Path to the pre-trained wordpiece vocabulary. See the contents of e.g. https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
  
  unknownToken - a string used to indicate a token was not found in the vocabulary - typically "[UNK]"
  
  maxInputCharactersPerWord - a maximum to shield against looping over character-by-character pathologically long "tokens"
Method Details
- postConfig
  
  public void postConfig() throws IOException
  
  Used by the OLCUT configuration system, and should not be called by external code.
  
  Specified by:
  
  postConfig in interface com.oracle.labs.mlrg.olcut.config.Configurable
  
  Throws:
  
  IOException
- wordpiece
  
  public List<String> wordpiece(String token)
  
  Executes Wordpiece tokenization on the provided token. Note that tokens corresponding to word suffixes as indicated in the provided vocabulary with the sequence "##" prepended to the entry may be returned by this method. This method does not perform whitespace tokenization or any other preprocessing. This method does not lowercase the input token or otherwise modify it in any way.
  
  Parameters:
  
  token - the token to apply Wordpiece tokenization to.
  
  Returns:
  
  tokens corresponding to Wordpiece tokenization applied to the input text. Some tokens may have a prefix "##" as described above. Some tokens may correspond to an unknown token as specified during initialization (default "[UNK]")
- getUnknownToken
  
  public String getUnknownToken()
  
  a getter for the "unknown" token specified during initialization.
  
  Returns:
  
  the "unknown" token name - defaults to "[UNK]"
- getMaxInputCharactersPerWord
  
  public int getMaxInputCharactersPerWord()
  
  a getter for the maximum character count for a token to consider when wordpiece(String) is applied to a token. This value is set at initialization and defaults to 100. Token values passed to that method that are not tokenized and the result of getUnknownToken() is returned instead.
  
  Returns:
  
  the maximum length of a token that will be analyzed by wordpiece(String).

Class Wordpiece

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

DEFAULT_UNKNOWN_TOKEN

Constructor Details

Wordpiece

Wordpiece

Wordpiece

Wordpiece

Wordpiece

Method Details

postConfig

wordpiece

getUnknownToken

getMaxInputCharactersPerWord