org.tribuo.util.tokens.impl.wordpiece.WordpieceTokenizer

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class WordpieceTokenizer extends Object implements Tokenizer

This Tokenizer is meant to be a reasonable approximation of the BertTokenizer defined here. Please see class definition for BertTokenizer (the line numbers may change.) Please see notes in WordpieceTokenizerTest for information about how we tested the similarity between this tokenizer and the referenced python implementation and for regression test examples that fail. In short, there are outstanding discrepancies for texts that include Arabic and other non-latin scripts that generate so many "[UNK]" tokens for an English-based BPE vocabulary as to render the discrepancies as practically meaningless.

As in the reference implementation, the input text is whitespace tokenized and each token is further tokenized to account for things like punctuation and Chinese characters. The resulting tokens are then applied to the wordpiece algorithm implemented in Wordpiece which is driven by an input vocabulary which matches tokens and token suffixes as it can. Any tokens that are not found in the input vocabulary are marked as "unknown".

Constructor Summary

Constructors

Constructor

Description

WordpieceTokenizer(Wordpiece wordpiece, Tokenizer tokenizer, boolean toLowerCase, boolean stripAccents, Set<String> neverSplit)

Constructs a wordpiece tokenizer.
Method Summary

Modifier and Type

Method

Description

boolean

advance()

Advances the tokenizer to the next token.

WordpieceTokenizer

clone()

Clones a tokenizer with it's configuration.

int

getEnd()

Gets the ending offset (exclusive) of the current token in the character sequence

com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance

getProvenance()

int

getStart()

Gets the starting character offset of the current token in the character sequence

String

getText()

Gets the text of the current token, as a string

Token

getToken()

Generates a Token object from the current state of the tokenizer.

Token.TokenType

getType()

Gets the type of the current token.

void

reset(CharSequence cs)

Resets the tokenizer so that it operates on a new sequence of characters.

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig

Methods inherited from interface org.tribuo.util.tokens.Tokenizer
split, tokenize

Constructor Details
- WordpieceTokenizer
  
  public WordpieceTokenizer(Wordpiece wordpiece, Tokenizer tokenizer, boolean toLowerCase, boolean stripAccents, Set<String> neverSplit)
  
  Constructs a wordpiece tokenizer.
  
  Parameters:
  
  wordpiece - an instance of Wordpiece
  
  tokenizer - Wordpiece is run on the tokens generated by the tokenizer provided here.
  
  toLowerCase - determines whether or not to lowercase each token before running Wordpiece on it
  
  stripAccents - determines whether or not to strip out accents from each token before running Wordpiece on it
  
  neverSplit - a set of token values that we will not apply Wordpiece to.
Method Details
- getProvenance
  
  public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
  
  Specified by:
  
  getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
- reset
  
  public void reset(CharSequence cs)
  
  Description copied from interface: Tokenizer
  
  Resets the tokenizer so that it operates on a new sequence of characters.
  
  Specified by:
  
  reset in interface Tokenizer
  
  Parameters:
  
  cs - a character sequence to tokenize
- advance
  
  public boolean advance()
  
  Description copied from interface: Tokenizer
  
  Advances the tokenizer to the next token.
  
  Specified by:
  
  advance in interface Tokenizer
  
  Returns:
  
  true if there is such a token, false otherwise.
- getToken
  
  public Token getToken()
  
  Description copied from interface: Tokenizer
  
  Generates a Token object from the current state of the tokenizer.
  
  Specified by:
  
  getToken in interface Tokenizer
  
  Returns:
  
  The token object from the current state.
- getText
  
  public String getText()
  
  Description copied from interface: Tokenizer
  
  Gets the text of the current token, as a string
  
  Specified by:
  
  getText in interface Tokenizer
  
  Returns:
  
  the text of the current token
- getStart
  
  public int getStart()
  
  Description copied from interface: Tokenizer
  
  Gets the starting character offset of the current token in the character sequence
  
  Specified by:
  
  getStart in interface Tokenizer
  
  Returns:
  
  the starting character offset of the token
- getEnd
  
  public int getEnd()
  
  Description copied from interface: Tokenizer
  
  Gets the ending offset (exclusive) of the current token in the character sequence
  
  Specified by:
  
  getEnd in interface Tokenizer
  
  Returns:
  
  the exclusive ending character offset for the current token.
- getType
  
  public Token.TokenType getType()
  
  Description copied from interface: Tokenizer
  
  Gets the type of the current token.
  
  Specified by:
  
  getType in interface Tokenizer
  
  Returns:
  
  the type of the current token.
- clone
  
  public WordpieceTokenizer clone()
  
  Description copied from interface: Tokenizer
  
  Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
  
  Specified by:
  
  clone in interface Tokenizer
  
  Overrides:
  
  clone in class Object
  
  Returns:
  
  A tokenizer with the same configuration, but independent state.

Class WordpieceTokenizer

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable

Methods inherited from interface org.tribuo.util.tokens.Tokenizer

Constructor Details

WordpieceTokenizer

Method Details

getProvenance

reset

advance

getToken

getText

getStart

getEnd

getType

clone