java.lang.Object

org.tribuo.util.tokens.impl.SplitFunctionTokenizer

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

Direct Known Subclasses:: SplitCharactersTokenizer, WhitespaceTokenizer, WordpieceBasicTokenizer

public abstract class SplitFunctionTokenizer extends Object implements Tokenizer

This class supports character-by-character (that is, codepoint-by-codepoint) iteration over input text to create tokens. Extensions of this class are initialized with a SplitFunctionTokenizer.SplitFunction which will be called for each character and a SplitFunctionTokenizer.SplitResult consisting of a SplitFunctionTokenizer.SplitType and a Token.TokenType will be returned. Tokenization is achieved based on the SplitFunctionTokenizer.SplitResult returned for each character. Please see notes below for each SplitFunctionTokenizer.SplitType and SplitFunctionTokenizer.SplitResult.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static interface

SplitFunctionTokenizer.SplitFunction

An interface for checking if the text should be split at the supplied codepoint.

static enum

SplitFunctionTokenizer.SplitResult

A combination of a SplitFunctionTokenizer.SplitType and a Token.TokenType.

static enum

SplitFunctionTokenizer.SplitType

Defines different ways that a tokenizer can split the input text at a given character.
Field Summary

Fields

Modifier and Type

Field

Description

protected SplitFunctionTokenizer.SplitFunction

splitFunction
Constructor Summary

Constructors

Modifier

Constructor

Description

protected

SplitFunctionTokenizer()

Constructs a tokenizer, used by OLCUT.

SplitFunctionTokenizer(SplitFunctionTokenizer.SplitFunction splitFunction)

Creates a new tokenizer using the supplied split function.
Method Summary

Modifier and Type

Method

Description

boolean

advance()

Advances the tokenizer to the next token.

Tokenizer

clone()

Clones a tokenizer with it's configuration.

int

getEnd()

Gets the ending offset (exclusive) of the current token in the character sequence

int

getStart()

Gets the starting character offset of the current token in the character sequence

String

getText()

Gets the text of the current token, as a string

Token.TokenType

getType()

Gets the type of the current token.

void

reset(CharSequence cs)

Resets the tokenizer so that it operates on a new sequence of characters.

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig

Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable
getProvenance

Methods inherited from interface org.tribuo.util.tokens.Tokenizer
getToken, split, tokenize

Field Details
- splitFunction
  
  protected SplitFunctionTokenizer.SplitFunction splitFunction
Constructor Details
- SplitFunctionTokenizer
  
  protected SplitFunctionTokenizer()
  
  Constructs a tokenizer, used by OLCUT.
- SplitFunctionTokenizer
  
  public SplitFunctionTokenizer(SplitFunctionTokenizer.SplitFunction splitFunction)
  
  Creates a new tokenizer using the supplied split function.
  
  Parameters:
  
  splitFunction - The split function.
Method Details
- reset
  
  public void reset(CharSequence cs)
  
  Description copied from interface: Tokenizer
  
  Resets the tokenizer so that it operates on a new sequence of characters.
  
  Specified by:
  
  reset in interface Tokenizer
  
  Parameters:
  
  cs - a character sequence to tokenize
- advance
  
  public boolean advance()
  
  Description copied from interface: Tokenizer
  
  Advances the tokenizer to the next token.
  
  Specified by:
  
  advance in interface Tokenizer
  
  Returns:
  
  true if there is such a token, false otherwise.
- getText
  
  public String getText()
  
  Description copied from interface: Tokenizer
  
  Gets the text of the current token, as a string
  
  Specified by:
  
  getText in interface Tokenizer
  
  Returns:
  
  the text of the current token
- getStart
  
  public int getStart()
  
  Description copied from interface: Tokenizer
  
  Gets the starting character offset of the current token in the character sequence
  
  Specified by:
  
  getStart in interface Tokenizer
  
  Returns:
  
  the starting character offset of the token
- getEnd
  
  public int getEnd()
  
  Description copied from interface: Tokenizer
  
  Gets the ending offset (exclusive) of the current token in the character sequence
  
  Specified by:
  
  getEnd in interface Tokenizer
  
  Returns:
  
  the exclusive ending character offset for the current token.
- getType
  
  public Token.TokenType getType()
  
  Description copied from interface: Tokenizer
  
  Gets the type of the current token.
  
  Specified by:
  
  getType in interface Tokenizer
  
  Returns:
  
  the type of the current token.
- clone
  
  public Tokenizer clone() throws CloneNotSupportedException
  
  Description copied from interface: Tokenizer
  
  Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
  
  Specified by:
  
  clone in interface Tokenizer
  
  Overrides:
  
  clone in class Object
  
  Returns:
  
  A tokenizer with the same configuration, but independent state.
  
  Throws:
  
  CloneNotSupportedException - if the tokenizer isn't cloneable.

Class SplitFunctionTokenizer

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable

Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable

Methods inherited from interface org.tribuo.util.tokens.Tokenizer

Field Details

splitFunction

Constructor Details

SplitFunctionTokenizer

SplitFunctionTokenizer

Method Details

reset

advance

getText

getStart

getEnd

getType

clone