Class SplitCharactersTokenizer

java.lang.Object
org.tribuo.util.tokens.impl.SplitFunctionTokenizer
org.tribuo.util.tokens.impl.SplitCharactersTokenizer
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class SplitCharactersTokenizer extends SplitFunctionTokenizer
This implementation of Tokenizer is instantiated with an array of characters that are considered split characters. That is, the split characters define where to split the input text. It's a very simplistic tokenizer that has one simple exceptional case that it handles: how to deal with split characters that appear in between digits (e.g., 3/5 and 3.1415). It's not really very general purpose, but may suffice for some use cases.

In addition to the split characters specified it also splits on anything that is considered whitespace by Character.isWhitespace(char).

Author:
Philip Ogren
  • Field Details

    • DEFAULT_SPLIT_CHARACTERS

      public static final char[] DEFAULT_SPLIT_CHARACTERS
      The default split characters.
    • DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS

      public static final char[] DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
      The default characters which don't cause splits inside digits.
  • Constructor Details

    • SplitCharactersTokenizer

      public SplitCharactersTokenizer()
      Creates a default split characters tokenizer using DEFAULT_SPLIT_CHARACTERS and DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS.
    • SplitCharactersTokenizer

      public SplitCharactersTokenizer(char[] splitCharacters, char[] splitXDigitsCharacters)
      Parameters:
      splitCharacters - characters to be replaced with a space in the input text (e.g., "abc|def" becomes "abc def")
      splitXDigitsCharacters - characters to be replaced with a space in the input text except in the circumstance where the character immediately adjacent to the left and right are digits (e.g., "abc.def" becomes "abc def" but "3.1415" remains "3.1415").
  • Method Details

    • postConfig

      public void postConfig()
    • createWhitespaceTokenizer

      public static SplitCharactersTokenizer createWhitespaceTokenizer()
      Creates a tokenizer that splits on whitespace.
      Returns:
      A whitespace tokenizer.
    • getProvenance

      public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
    • isSplitCharacter

      @Deprecated public boolean isSplitCharacter(char c)
      Deprecated.
      Is this character a split character for this tokenizer instance.
      Parameters:
      c - The character to check.
      Returns:
      True if it's a split character.
    • isSplitXDigitCharacter

      @Deprecated public boolean isSplitXDigitCharacter(char c)
      Deprecated.
      Is this character a split character except inside a digit for this tokenizer instance.
      Parameters:
      c - The character to check.
      Returns:
      True if it's a split character.
    • getSplitCharacters

      @Deprecated public char[] getSplitCharacters()
      Deprecated.
      Returns a copy of the split characters.
      Returns:
      A copy of the split characters.
    • getSplitXDigitsCharacters

      @Deprecated public char[] getSplitXDigitsCharacters()
      Deprecated.
      Returns a copy of the split characters except inside digits.
      Returns:
      A copy of the split characters.
    • clone

      public SplitCharactersTokenizer clone()
      Description copied from interface: Tokenizer
      Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
      Specified by:
      clone in interface Tokenizer
      Overrides:
      clone in class SplitFunctionTokenizer
      Returns:
      A tokenizer with the same configuration, but independent state.