Interface Tokenizer

All Superinterfaces:
Cloneable, com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
All Known Implementing Classes:
BreakIteratorTokenizer, NonTokenizer, ShapeTokenizer, SplitCharactersTokenizer, SplitFunctionTokenizer, SplitPatternTokenizer, UniversalTokenizer, WhitespaceTokenizer, WordpieceBasicTokenizer, WordpieceTokenizer

public interface Tokenizer extends com.oracle.labs.mlrg.olcut.config.Configurable, Cloneable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
An interface for things that tokenize text: breaking it into words according to some set of rules.

Note that tokenizers are not guaranteed to be thread safe! Using the same tokenizer from multiple threads may result in strange behavior.

Tokenizers which are not ready throw IllegalStateException when advance() or any get method is called.

Most Tokenizers are Cloneable, and implement the Cloneable interface.

  • Method Summary

    Modifier and Type
    Method
    Description
    boolean
    Advances the tokenizer to the next token.
    Clones a tokenizer with it's configuration.
    Creates a supplier from the specified tokenizer by cloning it.
    Creates a thread local source of tokenizers by making a Tokenizer supplier using createSupplier(Tokenizer).
    int
    Gets the ending offset (exclusive) of the current token in the character sequence
    int
    Gets the starting character offset of the current token in the character sequence
    Gets the text of the current token, as a string
    default Token
    Generates a Token object from the current state of the tokenizer.
    Gets the type of the current token.
    void
    Resets the tokenizer so that it operates on a new sequence of characters.
    default List<String>
    Uses this tokenizer to split a string into it's component substrings.
    default List<Token>
    Uses this tokenizer to tokenize a string and return the list of tokens that were generated.

    Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable

    postConfig

    Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable

    getProvenance
  • Method Details

    • createSupplier

      static Supplier<Tokenizer> createSupplier(Tokenizer tokenizer)
      Creates a supplier from the specified tokenizer by cloning it.
      Parameters:
      tokenizer - The tokenizer to copy.
      Returns:
      A supplier of tokenizers.
    • createThreadLocal

      static ThreadLocal<Tokenizer> createThreadLocal(Tokenizer tokenizer)
      Creates a thread local source of tokenizers by making a Tokenizer supplier using createSupplier(Tokenizer).
      Parameters:
      tokenizer - The tokenizer to copy.
      Returns:
      A thread local for tokenizers.
    • reset

      void reset(CharSequence cs)
      Resets the tokenizer so that it operates on a new sequence of characters.
      Parameters:
      cs - a character sequence to tokenize
    • advance

      boolean advance()
      Advances the tokenizer to the next token.
      Returns:
      true if there is such a token, false otherwise.
    • getText

      String getText()
      Gets the text of the current token, as a string
      Returns:
      the text of the current token
    • getStart

      int getStart()
      Gets the starting character offset of the current token in the character sequence
      Returns:
      the starting character offset of the token
    • getEnd

      int getEnd()
      Gets the ending offset (exclusive) of the current token in the character sequence
      Returns:
      the exclusive ending character offset for the current token.
    • getType

      Token.TokenType getType()
      Gets the type of the current token.
      Returns:
      the type of the current token.
    • clone

      Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
      Returns:
      A tokenizer with the same configuration, but independent state.
      Throws:
      CloneNotSupportedException - if the tokenizer isn't cloneable.
    • getToken

      default Token getToken()
      Generates a Token object from the current state of the tokenizer.
      Returns:
      The token object from the current state.
    • tokenize

      default List<Token> tokenize(CharSequence cs)
      Uses this tokenizer to tokenize a string and return the list of tokens that were generated. Many applications will simply want to take a character sequence and get a list of tokens, so this will do that for them.

      Here is the contract of the tokenize function:

      • all returned tokens correspond to substrings of the input text
      • the tokens do not overlap
      • the tokens are returned in the order that they appear in the text
      • the value of Token.text should be the same as calling text.substring(token.start, token.end)
      Parameters:
      cs - a sequence of characters to tokenize
      Returns:
      the tokens discovered in the character sequence, in order (true?).
    • split

      default List<String> split(CharSequence cs)
      Uses this tokenizer to split a string into it's component substrings. Many applications will simply want the component strings making up a larger character sequence.
      Parameters:
      cs - the character sequence to tokenize
      Returns:
      a list of strings making up the character sequence.