org.tribuo.util.tokens.impl.SplitPatternTokenizer

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>, Cloneable, Tokenizer

public class SplitPatternTokenizer extends Object implements Tokenizer

This implementation of Tokenizer is instantiated with a regular expression pattern which determines how to split a string into tokens. That is, the pattern defines the "splits", not the tokens. For example, to tokenize on white space provide the pattern "\s+".

Author:: Philip Ogren

Field Summary

Fields

Modifier and Type

Field

Description

static final String

SIMPLE_DEFAULT_PATTERN

The default split pattern, which is [\.,]?\s+.
Constructor Summary

Constructors

Constructor

Description

SplitPatternTokenizer()

Initializes a case insensitive tokenizer with the pattern [\.,]?\s+

SplitPatternTokenizer(String splitPatternRegex)

Constructs a splitting tokenizer using the supplied regex.
Method Summary

Modifier and Type

Method

Description

boolean

advance()

Advances the tokenizer to the next token.

SplitPatternTokenizer

clone()

Clones a tokenizer with it's configuration.

int

getEnd()

Gets the ending offset (exclusive) of the current token in the character sequence

com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance

getProvenance()

String

getSplitPatternRegex()

Gets the String form of the regex in use.

int

getStart()

Gets the starting character offset of the current token in the character sequence

String

getText()

Gets the text of the current token, as a string

Token.TokenType

getType()

Gets the type of the current token.

void

postConfig()

Used by the OLCUT configuration system, and should not be called by external code.

void

reset(CharSequence cs)

Resets the tokenizer so that it operates on a new sequence of characters.

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.tribuo.util.tokens.Tokenizer
getToken, split, tokenize

Field Details
- SIMPLE_DEFAULT_PATTERN
  public static final String SIMPLE_DEFAULT_PATTERN
  
  The default split pattern, which is [\.,]?\s+.
  
  See Also:
  
  Constant Field Values
Constructor Details
- SplitPatternTokenizer
  
  public SplitPatternTokenizer()
  
  Initializes a case insensitive tokenizer with the pattern [\.,]?\s+
- SplitPatternTokenizer
  
  public SplitPatternTokenizer(String splitPatternRegex)
  
  Constructs a splitting tokenizer using the supplied regex.
  
  Parameters:
  
  splitPatternRegex - The regex to use.
Method Details
- postConfig
  
  public void postConfig()
  
  Used by the OLCUT configuration system, and should not be called by external code.
  
  Specified by:
  
  postConfig in interface com.oracle.labs.mlrg.olcut.config.Configurable
- getProvenance
  
  public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
  
  Specified by:
  
  getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
- getSplitPatternRegex
  
  public String getSplitPatternRegex()
  
  Gets the String form of the regex in use.
  
  Returns:
  
  The regex.
- reset
  
  public void reset(CharSequence cs)
  
  Description copied from interface: Tokenizer
  
  Resets the tokenizer so that it operates on a new sequence of characters.
  
  Specified by:
  
  reset in interface Tokenizer
  
  Parameters:
  
  cs - a character sequence to tokenize
- advance
  
  public boolean advance()
  
  Description copied from interface: Tokenizer
  
  Advances the tokenizer to the next token.
  
  Specified by:
  
  advance in interface Tokenizer
  
  Returns:
  
  true if there is such a token, false otherwise.
- getText
  
  public String getText()
  
  Description copied from interface: Tokenizer
  
  Gets the text of the current token, as a string
  
  Specified by:
  
  getText in interface Tokenizer
  
  Returns:
  
  the text of the current token
- getStart
  
  public int getStart()
  
  Description copied from interface: Tokenizer
  
  Gets the starting character offset of the current token in the character sequence
  
  Specified by:
  
  getStart in interface Tokenizer
  
  Returns:
  
  the starting character offset of the token
- getEnd
  
  public int getEnd()
  
  Description copied from interface: Tokenizer
  
  Gets the ending offset (exclusive) of the current token in the character sequence
  
  Specified by:
  
  getEnd in interface Tokenizer
  
  Returns:
  
  the exclusive ending character offset for the current token.
- getType
  
  public Token.TokenType getType()
  
  Description copied from interface: Tokenizer
  
  Gets the type of the current token.
  
  Specified by:
  
  getType in interface Tokenizer
  
  Returns:
  
  the type of the current token.
- clone
  
  public SplitPatternTokenizer clone()
  
  Description copied from interface: Tokenizer
  
  Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
  
  Specified by:
  
  clone in interface Tokenizer
  
  Overrides:
  
  clone in class Object
  
  Returns:
  
  A tokenizer with the same configuration, but independent state.

Class SplitPatternTokenizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.tribuo.util.tokens.Tokenizer

Field Details

SIMPLE_DEFAULT_PATTERN

Constructor Details

SplitPatternTokenizer

SplitPatternTokenizer

Method Details

postConfig

getProvenance

getSplitPatternRegex

reset

advance

getText

getStart

getEnd

getType

clone