Class SplitCharactersTokenizer

java.lang.Object
org.tribuo.util.tokens.impl.SplitCharactersTokenizer
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class SplitCharactersTokenizer extends Object implements Tokenizer
This implementation of Tokenizer is instantiated with an array of characters that are considered split characters. That is, the split characters define where to split the input text. It's a very simplistic tokenizer that has one simple exceptional case that it handles: how to deal with split characters that appear in between digits (e.g., 3/5 and 3.1415). It's not really very general purpose, but may suffice for some use cases.

In addition to the split characters specified it also splits on anything that is considered whitespace by Character.isWhitespace(char).

Author:
Philip Ogren
  • Field Details

  • Constructor Details

    • SplitCharactersTokenizer

    • SplitCharactersTokenizer

      public SplitCharactersTokenizer(char[] splitCharacters, char[] splitXDigitsCharacters)
      Parameters:
      splitCharacters - characters to be replaced with a space in the input text (e.g., "abc|def" becomes "abc def")
      splitXDigitsCharacters - characters to be replaced with a space in the input text except in the circumstance where the character immediately adjacent to the left and right are digits (e.g., "abc.def" becomes "abc def" but "3.1415" remains "3.1415").
  • Method Details

    • createWhitespaceTokenizer

      Creates a tokenizer that splits on whitespace.
      Returns:
      A whitespace tokenizer.
    • getProvenance

      public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
      Specified by:
      getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
    • reset

      public void reset(CharSequence cs)
      Description copied from interface: Tokenizer
      Resets the tokenizer so that it operates on a new sequence of characters.
      Specified by:
      reset in interface Tokenizer
      Parameters:
      cs - a character sequence to tokenize
    • advance

      public boolean advance()
      Description copied from interface: Tokenizer
      Advances the tokenizer to the next token.
      Specified by:
      advance in interface Tokenizer
      Returns:
      true if there is such a token, false otherwise.
    • getText

      public String getText()
      Description copied from interface: Tokenizer
      Gets the text of the current token, as a string
      Specified by:
      getText in interface Tokenizer
      Returns:
      the text of the current token
    • getStart

      public int getStart()
      Description copied from interface: Tokenizer
      Gets the starting character offset of the current token in the character sequence
      Specified by:
      getStart in interface Tokenizer
      Returns:
      the starting character offset of the token
    • getEnd

      public int getEnd()
      Description copied from interface: Tokenizer
      Gets the ending offset (exclusive) of the current token in the character sequence
      Specified by:
      getEnd in interface Tokenizer
      Returns:
      the exclusive ending character offset for the current token.
    • getType

      Description copied from interface: Tokenizer
      Gets the type of the current token.
      Specified by:
      getType in interface Tokenizer
      Returns:
      the type of the current token.
    • clone

      Description copied from interface: Tokenizer
      Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
      Specified by:
      clone in interface Tokenizer
      Overrides:
      clone in class Object
      Returns:
      A tokenizer with the same configuration, but independent state.
    • isSplitCharacter

      public boolean isSplitCharacter(char c)
      Is this character a split character for this tokenizer instance.
      Parameters:
      c - The character to check.
      Returns:
      True if it's a split character.
    • isSplitXDigitCharacter

      public boolean isSplitXDigitCharacter(char c)
      Is this character a split character except inside a digit for this tokenizer instance.
      Parameters:
      c - The character to check.
      Returns:
      True if it's a split character.
    • getSplitCharacters

      public char[] getSplitCharacters()
      Returns a copy of the split characters.
      Returns:
      A copy of the split characters.
    • getSplitXDigitsCharacters

      public char[] getSplitXDigitsCharacters()
      Returns a copy of the split characters except inside digits.
      Returns:
      A copy of the split characters.