org.tribuo.util.tokens.impl.SplitCharactersTokenizer

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class SplitCharactersTokenizer extends Object implements Tokenizer

This implementation of Tokenizer is instantiated with an array of characters that are considered split characters. That is, the split characters define where to split the input text. It's a very simplistic tokenizer that has one simple exceptional case that it handles: how to deal with split characters that appear in between digits (e.g., 3/5 and 3.1415). It's not really very general purpose, but may suffice for some use cases.

In addition to the split characters specified it also splits on anything that is considered whitespace by Character.isWhitespace(char).

Author:: Philip Ogren

Field Summary

Fields

Modifier and Type

Field

Description

static final char[]

DEFAULT_SPLIT_CHARACTERS

static final char[]

DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
Constructor Summary

Constructors

Constructor

Description

SplitCharactersTokenizer()

SplitCharactersTokenizer(char[] splitCharacters, char[] splitXDigitsCharacters)
Method Summary

Modifier and Type

Method

Description

boolean

advance()

Advances the tokenizer to the next token.

SplitCharactersTokenizer

clone()

Clones a tokenizer with it's configuration.

static SplitCharactersTokenizer

createWhitespaceTokenizer()

Creates a tokenizer that splits on whitespace.

int

getEnd()

Gets the ending offset (exclusive) of the current token in the character sequence

com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance

getProvenance()

char[]

getSplitCharacters()

Returns a copy of the split characters.

char[]

getSplitXDigitsCharacters()

Returns a copy of the split characters except inside digits.

int

getStart()

Gets the starting character offset of the current token in the character sequence

String

getText()

Gets the text of the current token, as a string

Token.TokenType

getType()

Gets the type of the current token.

boolean

isSplitCharacter(char c)

Is this character a split character for this tokenizer instance.

boolean

isSplitXDigitCharacter(char c)

Is this character a split character except inside a digit for this tokenizer instance.

void

reset(CharSequence cs)

Resets the tokenizer so that it operates on a new sequence of characters.

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig

Methods inherited from interface org.tribuo.util.tokens.Tokenizer
getToken, split, tokenize

Field Details
- DEFAULT_SPLIT_CHARACTERS
  
  public static final char[] DEFAULT_SPLIT_CHARACTERS
- DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
  
  public static final char[] DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
Constructor Details
- SplitCharactersTokenizer
  
  public SplitCharactersTokenizer()
- SplitCharactersTokenizer
  
  public SplitCharactersTokenizer(char[] splitCharacters, char[] splitXDigitsCharacters)
  
  Parameters:
  
  splitCharacters - characters to be replaced with a space in the input text (e.g., "abc|def" becomes "abc def")
  
  splitXDigitsCharacters - characters to be replaced with a space in the input text except in the circumstance where the character immediately adjacent to the left and right are digits (e.g., "abc.def" becomes "abc def" but "3.1415" remains "3.1415").
Method Details
- createWhitespaceTokenizer
  
  public static SplitCharactersTokenizer createWhitespaceTokenizer()
  
  Creates a tokenizer that splits on whitespace.
  
  Returns:
  
  A whitespace tokenizer.
- getProvenance
  
  public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
  
  Specified by:
  
  getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
- reset
  
  public void reset(CharSequence cs)
  
  Description copied from interface: Tokenizer
  
  Resets the tokenizer so that it operates on a new sequence of characters.
  
  Specified by:
  
  reset in interface Tokenizer
  
  Parameters:
  
  cs - a character sequence to tokenize
- advance
  
  public boolean advance()
  
  Description copied from interface: Tokenizer
  
  Advances the tokenizer to the next token.
  
  Specified by:
  
  advance in interface Tokenizer
  
  Returns:
  
  true if there is such a token, false otherwise.
- getText
  
  public String getText()
  
  Description copied from interface: Tokenizer
  
  Gets the text of the current token, as a string
  
  Specified by:
  
  getText in interface Tokenizer
  
  Returns:
  
  the text of the current token
- getStart
  
  public int getStart()
  
  Description copied from interface: Tokenizer
  
  Gets the starting character offset of the current token in the character sequence
  
  Specified by:
  
  getStart in interface Tokenizer
  
  Returns:
  
  the starting character offset of the token
- getEnd
  
  public int getEnd()
  
  Description copied from interface: Tokenizer
  
  Gets the ending offset (exclusive) of the current token in the character sequence
  
  Specified by:
  
  getEnd in interface Tokenizer
  
  Returns:
  
  the exclusive ending character offset for the current token.
- getType
  
  public Token.TokenType getType()
  
  Description copied from interface: Tokenizer
  
  Gets the type of the current token.
  
  Specified by:
  
  getType in interface Tokenizer
  
  Returns:
  
  the type of the current token.
- clone
  
  public SplitCharactersTokenizer clone()
  
  Description copied from interface: Tokenizer
  
  Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
  
  Specified by:
  
  clone in interface Tokenizer
  
  Overrides:
  
  clone in class Object
  
  Returns:
  
  A tokenizer with the same configuration, but independent state.
- isSplitCharacter
  
  public boolean isSplitCharacter(char c)
  
  Is this character a split character for this tokenizer instance.
  
  Parameters:
  
  c - The character to check.
  
  Returns:
  
  True if it's a split character.
- isSplitXDigitCharacter
  
  public boolean isSplitXDigitCharacter(char c)
  
  Is this character a split character except inside a digit for this tokenizer instance.
  
  Parameters:
  
  c - The character to check.
  
  Returns:
  
  True if it's a split character.
- getSplitCharacters
  
  public char[] getSplitCharacters()
  
  Returns a copy of the split characters.
  
  Returns:
  
  A copy of the split characters.
- getSplitXDigitsCharacters
  
  public char[] getSplitXDigitsCharacters()
  
  Returns a copy of the split characters except inside digits.
  
  Returns:
  
  A copy of the split characters.

Class SplitCharactersTokenizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable

Methods inherited from interface org.tribuo.util.tokens.Tokenizer

Field Details

DEFAULT_SPLIT_CHARACTERS

DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS

Constructor Details

SplitCharactersTokenizer

SplitCharactersTokenizer

Method Details

createWhitespaceTokenizer

getProvenance

reset

advance

getText

getStart

getEnd

getType

clone

isSplitCharacter

isSplitXDigitCharacter

getSplitCharacters

getSplitXDigitsCharacters