org.tribuo.util.tokens.impl.SplitFunctionTokenizer

org.tribuo.util.tokens.impl.SplitCharactersTokenizer

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class SplitCharactersTokenizer extends SplitFunctionTokenizer

This implementation of Tokenizer is instantiated with an array of characters that are considered split characters. That is, the split characters define where to split the input text. It's a very simplistic tokenizer that has one simple exceptional case that it handles: how to deal with split characters that appear in between digits (e.g., 3/5 and 3.1415). It's not really very general purpose, but may suffice for some use cases.

In addition to the split characters specified it also splits on anything that is considered whitespace by Character.isWhitespace(char).

Author:: Philip Ogren

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

SplitCharactersTokenizer.SplitCharactersSplitterFunction

Splits tokens at the supplied characters.

Nested classes/interfaces inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
SplitFunctionTokenizer.SplitFunction, SplitFunctionTokenizer.SplitResult, SplitFunctionTokenizer.SplitType
Field Summary

Fields

Modifier and Type

Field

Description

static final char[]

DEFAULT_SPLIT_CHARACTERS

The default split characters.

static final char[]

DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS

The default characters which don't cause splits inside digits.

Fields inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
splitFunction
Constructor Summary

Constructors

Constructor

Description

SplitCharactersTokenizer()

Creates a default split characters tokenizer using DEFAULT_SPLIT_CHARACTERS and DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS.

SplitCharactersTokenizer(char[] splitCharacters, char[] splitXDigitsCharacters)
Method Summary

Modifier and Type

Method

Description

SplitCharactersTokenizer

clone()

Clones a tokenizer with it's configuration.

static SplitCharactersTokenizer

createWhitespaceTokenizer()

Creates a tokenizer that splits on whitespace.

com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance

getProvenance()

char[]

getSplitCharacters()

Deprecated.

char[]

getSplitXDigitsCharacters()

Deprecated.

boolean

isSplitCharacter(char c)

Deprecated.

boolean

isSplitXDigitCharacter(char c)

Deprecated.

void

postConfig()

Methods inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
advance, getEnd, getStart, getText, getType, reset

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.tribuo.util.tokens.Tokenizer
getToken, split, tokenize

Field Details
- DEFAULT_SPLIT_CHARACTERS
  
  public static final char[] DEFAULT_SPLIT_CHARACTERS
  
  The default split characters.
- DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
  
  public static final char[] DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS
  
  The default characters which don't cause splits inside digits.
Constructor Details
- SplitCharactersTokenizer
  
  public SplitCharactersTokenizer()
  
  Creates a default split characters tokenizer using DEFAULT_SPLIT_CHARACTERS and DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS.
- SplitCharactersTokenizer
  
  public SplitCharactersTokenizer(char[] splitCharacters, char[] splitXDigitsCharacters)
  
  Parameters:
  
  splitCharacters - characters to be replaced with a space in the input text (e.g., "abc|def" becomes "abc def")
  
  splitXDigitsCharacters - characters to be replaced with a space in the input text except in the circumstance where the character immediately adjacent to the left and right are digits (e.g., "abc.def" becomes "abc def" but "3.1415" remains "3.1415").
Method Details
- postConfig
  
  public void postConfig()
- createWhitespaceTokenizer
  
  public static SplitCharactersTokenizer createWhitespaceTokenizer()
  
  Creates a tokenizer that splits on whitespace.
  
  Returns:
  
  A whitespace tokenizer.
- getProvenance
  
  public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
- isSplitCharacter
  
  @Deprecated public boolean isSplitCharacter(char c)
  
  Deprecated.
  
  Is this character a split character for this tokenizer instance.
  
  Parameters:
  
  c - The character to check.
  
  Returns:
  
  True if it's a split character.
- isSplitXDigitCharacter
  
  @Deprecated public boolean isSplitXDigitCharacter(char c)
  
  Deprecated.
  
  Is this character a split character except inside a digit for this tokenizer instance.
  
  Parameters:
  
  c - The character to check.
  
  Returns:
  
  True if it's a split character.
- getSplitCharacters
  
  @Deprecated public char[] getSplitCharacters()
  
  Deprecated.
  
  Returns a copy of the split characters.
  
  Returns:
  
  A copy of the split characters.
- getSplitXDigitsCharacters
  
  @Deprecated public char[] getSplitXDigitsCharacters()
  
  Deprecated.
  
  Returns a copy of the split characters except inside digits.
  
  Returns:
  
  A copy of the split characters.
- clone
  
  public SplitCharactersTokenizer clone()
  
  Description copied from interface: Tokenizer
  
  Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
  
  Specified by:
  
  clone in interface Tokenizer
  
  Overrides:
  
  clone in class SplitFunctionTokenizer
  
  Returns:
  
  A tokenizer with the same configuration, but independent state.

Class SplitCharactersTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer

Field Summary

Fields inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer

Constructor Summary

Method Summary

Methods inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer

Methods inherited from class java.lang.Object

Methods inherited from interface org.tribuo.util.tokens.Tokenizer

Field Details

DEFAULT_SPLIT_CHARACTERS

DEFAULT_SPLIT_EXCEPTING_IN_DIGITS_CHARACTERS

Constructor Details

SplitCharactersTokenizer

SplitCharactersTokenizer

Method Details

postConfig

createWhitespaceTokenizer

getProvenance

isSplitCharacter

isSplitXDigitCharacter

getSplitCharacters

getSplitXDigitsCharacters

clone