Class BreakIteratorTokenizer

java.lang.Object
org.tribuo.util.tokens.impl.BreakIteratorTokenizer
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class BreakIteratorTokenizer extends Object implements Tokenizer
A tokenizer wrapping a BreakIterator instance.
  • Constructor Details

    • BreakIteratorTokenizer

      public BreakIteratorTokenizer(Locale locale)
      Constructs a BreakIteratorTokenizer using the specified locale.
      Parameters:
      locale - The locale to use.
  • Method Details

    • postConfig

      public void postConfig()
      Used by the OLCUT configuration system, and should not be called by external code.
      Specified by:
      postConfig in interface com.oracle.labs.mlrg.olcut.config.Configurable
    • getLanguageTag

      public String getLanguageTag()
      Returns the locale string this tokenizer uses.
      Returns:
      The locale string.
    • getProvenance

      public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
      Specified by:
      getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
    • reset

      public void reset(CharSequence cs)
      Description copied from interface: Tokenizer
      Resets the tokenizer so that it operates on a new sequence of characters.
      Specified by:
      reset in interface Tokenizer
      Parameters:
      cs - a character sequence to tokenize
    • advance

      public boolean advance()
      Description copied from interface: Tokenizer
      Advances the tokenizer to the next token.
      Specified by:
      advance in interface Tokenizer
      Returns:
      true if there is such a token, false otherwise.
    • getText

      public String getText()
      Description copied from interface: Tokenizer
      Gets the text of the current token, as a string
      Specified by:
      getText in interface Tokenizer
      Returns:
      the text of the current token
    • getStart

      public int getStart()
      Description copied from interface: Tokenizer
      Gets the starting character offset of the current token in the character sequence
      Specified by:
      getStart in interface Tokenizer
      Returns:
      the starting character offset of the token
    • getEnd

      public int getEnd()
      Description copied from interface: Tokenizer
      Gets the ending offset (exclusive) of the current token in the character sequence
      Specified by:
      getEnd in interface Tokenizer
      Returns:
      the exclusive ending character offset for the current token.
    • getType

      public Token.TokenType getType()
      Description copied from interface: Tokenizer
      Gets the type of the current token.
      Specified by:
      getType in interface Tokenizer
      Returns:
      the type of the current token.
    • clone

      public BreakIteratorTokenizer clone()
      Description copied from interface: Tokenizer
      Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
      Specified by:
      clone in interface Tokenizer
      Overrides:
      clone in class Object
      Returns:
      A tokenizer with the same configuration, but independent state.