Class ShapeTokenizer

java.lang.Object
org.tribuo.util.tokens.impl.ShapeTokenizer
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class ShapeTokenizer extends Object implements Tokenizer
This tokenizer is loosely based on the notion of word shape which is a common feature used in NLP. The idea here is that continuous runs of letters in the same character class will be grouped together. White space characters are used as delimiters. The character classes are: uppercase letters, lowercase letters, digits, and everything else goes into its own character class. So, for example, "1234abcd" would be split into "1234" and "abcd". And "!@#$" would result in four tokens. Please see unit tests.

Strings are split according to whitespace and contiguous runs of characters in the same character classes. Except for one exception - if uppercase letters are immediately followed by lowercase letters, then we keep them together. This has the effect of recognizing camel case and splits "CamelCase" into "Camel" and "Case". It also splits "ABCdef AAbb" into "ABCdef" and "AAbb".

  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructs a ShapeTokenizer.
  • Method Summary

    Modifier and Type
    Method
    Description
    boolean
    Advances the tokenizer to the next token.
    Clones a tokenizer with it's configuration.
    int
    Gets the ending offset (exclusive) of the current token in the character sequence
    com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
     
    int
    Gets the starting character offset of the current token in the character sequence
    Gets the text of the current token, as a string
    Gets the type of the current token.
    void
    Resets the tokenizer so that it operates on a new sequence of characters.

    Methods inherited from class java.lang.Object

    equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable

    postConfig

    Methods inherited from interface org.tribuo.util.tokens.Tokenizer

    getToken, split, tokenize
  • Constructor Details

    • ShapeTokenizer

      public ShapeTokenizer()
      Constructs a ShapeTokenizer.
  • Method Details

    • getProvenance

      public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
      Specified by:
      getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
    • reset

      public void reset(CharSequence cs)
      Description copied from interface: Tokenizer
      Resets the tokenizer so that it operates on a new sequence of characters.
      Specified by:
      reset in interface Tokenizer
      Parameters:
      cs - a character sequence to tokenize
    • advance

      public boolean advance()
      Description copied from interface: Tokenizer
      Advances the tokenizer to the next token.
      Specified by:
      advance in interface Tokenizer
      Returns:
      true if there is such a token, false otherwise.
    • getText

      public String getText()
      Description copied from interface: Tokenizer
      Gets the text of the current token, as a string
      Specified by:
      getText in interface Tokenizer
      Returns:
      the text of the current token
    • getStart

      public int getStart()
      Description copied from interface: Tokenizer
      Gets the starting character offset of the current token in the character sequence
      Specified by:
      getStart in interface Tokenizer
      Returns:
      the starting character offset of the token
    • getEnd

      public int getEnd()
      Description copied from interface: Tokenizer
      Gets the ending offset (exclusive) of the current token in the character sequence
      Specified by:
      getEnd in interface Tokenizer
      Returns:
      the exclusive ending character offset for the current token.
    • getType

      public Token.TokenType getType()
      Description copied from interface: Tokenizer
      Gets the type of the current token.
      Specified by:
      getType in interface Tokenizer
      Returns:
      the type of the current token.
    • clone

      public ShapeTokenizer clone()
      Description copied from interface: Tokenizer
      Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
      Specified by:
      clone in interface Tokenizer
      Overrides:
      clone in class Object
      Returns:
      A tokenizer with the same configuration, but independent state.