Class SplitFunctionTokenizer

java.lang.Object
org.tribuo.util.tokens.impl.SplitFunctionTokenizer
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer
Direct Known Subclasses:
SplitCharactersTokenizer, WhitespaceTokenizer, WordpieceBasicTokenizer

public abstract class SplitFunctionTokenizer extends Object implements Tokenizer
This class supports character-by-character (that is, codepoint-by-codepoint) iteration over input text to create tokens. Extensions of this class are initialized with a SplitFunctionTokenizer.SplitFunction which will be called for each character and a SplitFunctionTokenizer.SplitResult consisting of a SplitFunctionTokenizer.SplitType and a Token.TokenType will be returned. Tokenization is achieved based on the SplitFunctionTokenizer.SplitResult returned for each character. Please see notes below for each SplitFunctionTokenizer.SplitType and SplitFunctionTokenizer.SplitResult.
  • Field Details

  • Constructor Details

    • SplitFunctionTokenizer

      protected SplitFunctionTokenizer()
      Constructs a tokenizer, used by OLCUT.
    • SplitFunctionTokenizer

      public SplitFunctionTokenizer(SplitFunctionTokenizer.SplitFunction splitFunction)
      Creates a new tokenizer using the supplied split function.
      Parameters:
      splitFunction - The split function.
  • Method Details

    • reset

      public void reset(CharSequence cs)
      Description copied from interface: Tokenizer
      Resets the tokenizer so that it operates on a new sequence of characters.
      Specified by:
      reset in interface Tokenizer
      Parameters:
      cs - a character sequence to tokenize
    • advance

      public boolean advance()
      Description copied from interface: Tokenizer
      Advances the tokenizer to the next token.
      Specified by:
      advance in interface Tokenizer
      Returns:
      true if there is such a token, false otherwise.
    • getText

      public String getText()
      Description copied from interface: Tokenizer
      Gets the text of the current token, as a string
      Specified by:
      getText in interface Tokenizer
      Returns:
      the text of the current token
    • getStart

      public int getStart()
      Description copied from interface: Tokenizer
      Gets the starting character offset of the current token in the character sequence
      Specified by:
      getStart in interface Tokenizer
      Returns:
      the starting character offset of the token
    • getEnd

      public int getEnd()
      Description copied from interface: Tokenizer
      Gets the ending offset (exclusive) of the current token in the character sequence
      Specified by:
      getEnd in interface Tokenizer
      Returns:
      the exclusive ending character offset for the current token.
    • getType

      public Token.TokenType getType()
      Description copied from interface: Tokenizer
      Gets the type of the current token.
      Specified by:
      getType in interface Tokenizer
      Returns:
      the type of the current token.
    • clone

      public Tokenizer clone() throws CloneNotSupportedException
      Description copied from interface: Tokenizer
      Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
      Specified by:
      clone in interface Tokenizer
      Overrides:
      clone in class Object
      Returns:
      A tokenizer with the same configuration, but independent state.
      Throws:
      CloneNotSupportedException - if the tokenizer isn't cloneable.