Package org.tribuo.util.tokens.impl
Class BreakIteratorTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.BreakIteratorTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
,Cloneable
,Tokenizer
A tokenizer wrapping a
BreakIterator
instance.-
Constructor Summary
ConstructorDescriptionBreakIteratorTokenizer
(Locale locale) Constructs a BreakIteratorTokenizer using the specified locale. -
Method Summary
Modifier and TypeMethodDescriptionboolean
advance()
Advances the tokenizer to the next token.clone()
Clones a tokenizer with it's configuration.int
getEnd()
Gets the ending offset (exclusive) of the current token in the character sequenceReturns the locale string this tokenizer uses.com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
int
getStart()
Gets the starting character offset of the current token in the character sequencegetText()
Gets the text of the current token, as a stringgetType()
Gets the type of the current token.void
Used by the OLCUT configuration system, and should not be called by external code.void
reset
(CharSequence cs) Resets the tokenizer so that it operates on a new sequence of characters.
-
Constructor Details
-
BreakIteratorTokenizer
Constructs a BreakIteratorTokenizer using the specified locale.- Parameters:
locale
- The locale to use.
-
-
Method Details
-
postConfig
public void postConfig()Used by the OLCUT configuration system, and should not be called by external code.- Specified by:
postConfig
in interfacecom.oracle.labs.mlrg.olcut.config.Configurable
-
getLanguageTag
Returns the locale string this tokenizer uses.- Returns:
- The locale string.
-
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()- Specified by:
getProvenance
in interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
-
reset
Description copied from interface:Tokenizer
Resets the tokenizer so that it operates on a new sequence of characters. -
advance
public boolean advance()Description copied from interface:Tokenizer
Advances the tokenizer to the next token. -
getText
Description copied from interface:Tokenizer
Gets the text of the current token, as a string -
getStart
public int getStart()Description copied from interface:Tokenizer
Gets the starting character offset of the current token in the character sequence -
getEnd
public int getEnd()Description copied from interface:Tokenizer
Gets the ending offset (exclusive) of the current token in the character sequence -
getType
Description copied from interface:Tokenizer
Gets the type of the current token. -
clone
Description copied from interface:Tokenizer
Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
-