All Superinterfaces:: Cloneable, com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>

All Known Implementing Classes:: BreakIteratorTokenizer, NonTokenizer, ShapeTokenizer, SplitCharactersTokenizer, SplitFunctionTokenizer, SplitPatternTokenizer, UniversalTokenizer, WhitespaceTokenizer, WordpieceBasicTokenizer, WordpieceTokenizer

public interface Tokenizer extends com.oracle.labs.mlrg.olcut.config.Configurable, Cloneable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>

An interface for things that tokenize text: breaking it into words according to some set of rules.

Note that tokenizers are not guaranteed to be thread safe! Using the same tokenizer from multiple threads may result in strange behavior.

Tokenizers which are not ready throw IllegalStateException when advance() or any get method is called.

Most Tokenizers are Cloneable, and implement the Cloneable interface.

Method Summary

Modifier and Type

Method

Description

boolean

advance()

Advances the tokenizer to the next token.

Tokenizer

clone()

Clones a tokenizer with it's configuration.

static Supplier<Tokenizer>

createSupplier(Tokenizer tokenizer)

Creates a supplier from the specified tokenizer by cloning it.

static ThreadLocal<Tokenizer>

createThreadLocal(Tokenizer tokenizer)

Creates a thread local source of tokenizers by making a Tokenizer supplier using createSupplier(Tokenizer).

int

getEnd()

Gets the ending offset (exclusive) of the current token in the character sequence

int

getStart()

Gets the starting character offset of the current token in the character sequence

String

getText()

Gets the text of the current token, as a string

default Token

getToken()

Generates a Token object from the current state of the tokenizer.

Token.TokenType

getType()

Gets the type of the current token.

void

reset(CharSequence cs)

Resets the tokenizer so that it operates on a new sequence of characters.

default List<String>

split(CharSequence cs)

Uses this tokenizer to split a string into it's component substrings.

default List<Token>

tokenize(CharSequence cs)

Uses this tokenizer to tokenize a string and return the list of tokens that were generated.

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig

Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable
getProvenance

Method Details
- createSupplier
  
  static Supplier<Tokenizer> createSupplier(Tokenizer tokenizer)
  
  Creates a supplier from the specified tokenizer by cloning it.
  
  Parameters:
  
  tokenizer - The tokenizer to copy.
  
  Returns:
  
  A supplier of tokenizers.
- createThreadLocal
  
  static ThreadLocal<Tokenizer> createThreadLocal(Tokenizer tokenizer)
  
  Creates a thread local source of tokenizers by making a Tokenizer supplier using createSupplier(Tokenizer).
  
  Parameters:
  
  tokenizer - The tokenizer to copy.
  
  Returns:
  
  A thread local for tokenizers.
- reset
  
  void reset(CharSequence cs)
  
  Resets the tokenizer so that it operates on a new sequence of characters.
  
  Parameters:
  
  cs - a character sequence to tokenize
- advance
  
  boolean advance()
  
  Advances the tokenizer to the next token.
  
  Returns:
  
  true if there is such a token, false otherwise.
- getText
  
  String getText()
  
  Gets the text of the current token, as a string
  
  Returns:
  
  the text of the current token
- getStart
  
  int getStart()
  
  Gets the starting character offset of the current token in the character sequence
  
  Returns:
  
  the starting character offset of the token
- getEnd
  
  int getEnd()
  
  Gets the ending offset (exclusive) of the current token in the character sequence
  
  Returns:
  
  the exclusive ending character offset for the current token.
- getType
  
  Token.TokenType getType()
  
  Gets the type of the current token.
  
  Returns:
  
  the type of the current token.
- clone
  
  Tokenizer clone() throws CloneNotSupportedException
  
  Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
  
  Returns:
  
  A tokenizer with the same configuration, but independent state.
  
  Throws:
  
  CloneNotSupportedException - if the tokenizer isn't cloneable.
- getToken
  
  default Token getToken()
  
  Generates a Token object from the current state of the tokenizer.
  
  Returns:
  
  The token object from the current state.
- tokenize
  default List<Token> tokenize(CharSequence cs)
  
  Uses this tokenizer to tokenize a string and return the list of tokens that were generated. Many applications will simply want to take a character sequence and get a list of tokens, so this will do that for them.
  Here is the contract of the tokenize function:
  
  all returned tokens correspond to substrings of the input text
  
  the tokens do not overlap
  
  the tokens are returned in the order that they appear in the text
  
  the value of Token.text should be the same as calling text.substring(token.start, token.end)
  
  Parameters:
  
  cs - a sequence of characters to tokenize
  
  Returns:
  
  the tokens discovered in the character sequence, in order (true?).
- split
  
  default List<String> split(CharSequence cs)
  
  Uses this tokenizer to split a string into it's component substrings. Many applications will simply want the component strings making up a larger character sequence.
  
  Parameters:
  
  cs - the character sequence to tokenize
  
  Returns:
  
  a list of strings making up the character sequence.

Interface Tokenizer

Method Summary

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable

Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable

Method Details

createSupplier

createThreadLocal

reset

advance

getText

getStart

getEnd

getType

clone

getToken

tokenize

split