Class WordpieceBasicTokenizer

java.lang.Object
org.tribuo.util.tokens.impl.SplitFunctionTokenizer
org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class WordpieceBasicTokenizer extends SplitFunctionTokenizer
This is a tokenizer that is used "upstream" of WordpieceTokenizer and implements much of the functionality of the 'BasicTokenizer' implementation in huggingface. One minor difference in this implementation is that there is no set of "never_split" tokens used here. Those are handled by WordpieceTokenizer.
  • Constructor Details

    • WordpieceBasicTokenizer

      public WordpieceBasicTokenizer()
      Constructs a default tokenizer which tokenizes Chinese characters.
    • WordpieceBasicTokenizer

      public WordpieceBasicTokenizer(boolean tokenizeChineseChars)
      Constructs a tokenizer.
      Parameters:
      tokenizeChineseChars - Should the Chinese characters be split into individual tokens.
  • Method Details

    • createSplitFunction

      public static SplitFunctionTokenizer.SplitFunction createSplitFunction(boolean tokenizeChineseChars)
      Creates a SplitFunctionTokenizer.SplitFunction that is used by the super class SplitFunctionTokenizer to determine how and where the tokenizer splits the input.
      Parameters:
      tokenizeChineseChars - split Chinese characters into separate tokens?
      Returns:
      The splitting function.
    • isPunctuation

      public static boolean isPunctuation(int codepoint)
      Determines if the input code point should be considered a character that is punctuation. This will return true for all ascii characters that are not letters or digits and for any character whose Character type is defined as punctuation. See Character.getType(int).
      Parameters:
      codepoint - The codepoint to check.
      Returns:
      True if the codepoint is punctuation, false otherwise.
    • isChinese

      public static boolean isChinese(int codepoint)
      Determines if the provided codepoint is a Chinese character or not.
      Parameters:
      codepoint - a codepoint
      Returns:
      True if the codepoint is a Chinese character, false otherwise.
    • isControl

      public static boolean isControl(int codepoint)
      Determines if the provided codepoint is a control character or not.
      Parameters:
      codepoint - The codepoint to check.
      Returns:
      True if it's a control character, false otherwise.
    • postConfig

      public void postConfig()
      Used by the OLCUT configuration system, and should not be called by external code.
    • getProvenance

      public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
    • clone

      public WordpieceBasicTokenizer clone()
      Description copied from interface: Tokenizer
      Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
      Specified by:
      clone in interface Tokenizer
      Overrides:
      clone in class SplitFunctionTokenizer
      Returns:
      A tokenizer with the same configuration, but independent state.