org.tribuo.util.tokens.impl.SplitFunctionTokenizer

org.tribuo.util.tokens.impl.wordpiece.WordpieceBasicTokenizer

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class WordpieceBasicTokenizer extends SplitFunctionTokenizer

This is a tokenizer that is used "upstream" of WordpieceTokenizer and implements much of the functionality of the 'BasicTokenizer' implementation in huggingface. One minor difference in this implementation is that there is no set of "never_split" tokens used here. Those are handled by WordpieceTokenizer.

Nested Class Summary

Nested classes/interfaces inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
SplitFunctionTokenizer.SplitFunction, SplitFunctionTokenizer.SplitResult, SplitFunctionTokenizer.SplitType
Field Summary

Fields inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
splitFunction
Constructor Summary

Constructors

Constructor

Description

WordpieceBasicTokenizer()

Constructs a default tokenizer which tokenizes Chinese characters.

WordpieceBasicTokenizer(boolean tokenizeChineseChars)

Constructs a tokenizer.
Method Summary

Modifier and Type

Method

Description

WordpieceBasicTokenizer

clone()

Clones a tokenizer with it's configuration.

static SplitFunctionTokenizer.SplitFunction

createSplitFunction(boolean tokenizeChineseChars)

Creates a SplitFunctionTokenizer.SplitFunction that is used by the super class SplitFunctionTokenizer to determine how and where the tokenizer splits the input.

com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance

getProvenance()

static boolean

isChinese(int codepoint)

Determines if the provided codepoint is a Chinese character or not.

static boolean

isControl(int codepoint)

Determines if the provided codepoint is a control character or not.

static boolean

isPunctuation(int codepoint)

Determines if the input code point should be considered a character that is punctuation.

void

postConfig()

Used by the OLCUT configuration system, and should not be called by external code.

Methods inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
advance, getEnd, getStart, getText, getType, reset

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.tribuo.util.tokens.Tokenizer
getToken, split, tokenize

Constructor Details
- WordpieceBasicTokenizer
  
  public WordpieceBasicTokenizer()
  
  Constructs a default tokenizer which tokenizes Chinese characters.
- WordpieceBasicTokenizer
  
  public WordpieceBasicTokenizer(boolean tokenizeChineseChars)
  
  Constructs a tokenizer.
  
  Parameters:
  
  tokenizeChineseChars - Should the Chinese characters be split into individual tokens.
Method Details
- createSplitFunction
  
  public static SplitFunctionTokenizer.SplitFunction createSplitFunction(boolean tokenizeChineseChars)
  
  Creates a SplitFunctionTokenizer.SplitFunction that is used by the super class SplitFunctionTokenizer to determine how and where the tokenizer splits the input.
  
  Parameters:
  
  tokenizeChineseChars - split Chinese characters into separate tokens?
  
  Returns:
  
  The splitting function.
- isPunctuation
  
  public static boolean isPunctuation(int codepoint)
  
  Determines if the input code point should be considered a character that is punctuation. This will return true for all ascii characters that are not letters or digits and for any character whose Character type is defined as punctuation. See Character.getType(int).
  
  Parameters:
  
  codepoint - The codepoint to check.
  
  Returns:
  
  True if the codepoint is punctuation, false otherwise.
- isChinese
  
  public static boolean isChinese(int codepoint)
  
  Determines if the provided codepoint is a Chinese character or not.
  
  Parameters:
  
  codepoint - a codepoint
  
  Returns:
  
  True if the codepoint is a Chinese character, false otherwise.
- isControl
  
  public static boolean isControl(int codepoint)
  
  Determines if the provided codepoint is a control character or not.
  
  Parameters:
  
  codepoint - The codepoint to check.
  
  Returns:
  
  True if it's a control character, false otherwise.
- postConfig
  
  public void postConfig()
  
  Used by the OLCUT configuration system, and should not be called by external code.
- getProvenance
  
  public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
- clone
  
  public WordpieceBasicTokenizer clone()
  
  Description copied from interface: Tokenizer
  
  Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
  
  Specified by:
  
  clone in interface Tokenizer
  
  Overrides:
  
  clone in class SplitFunctionTokenizer
  
  Returns:
  
  A tokenizer with the same configuration, but independent state.

Class WordpieceBasicTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer

Field Summary

Fields inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer

Constructor Summary

Method Summary

Methods inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer

Methods inherited from class java.lang.Object

Methods inherited from interface org.tribuo.util.tokens.Tokenizer

Constructor Details

WordpieceBasicTokenizer

WordpieceBasicTokenizer

Method Details

createSplitFunction

isPunctuation

isChinese

isControl

postConfig

getProvenance

clone