Class WhitespaceTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.SplitFunctionTokenizer
org.tribuo.util.tokens.impl.WhitespaceTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>,Cloneable,Tokenizer
A simple tokenizer that splits on whitespace. This tokenizer does not create
tokens that correspond to whitespace - only those spans of text delimited by
whitespace. For example, the text "a b" will result in two tokens "a" and "b".
-
Nested Class Summary
Nested classes/interfaces inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
SplitFunctionTokenizer.SplitFunction, SplitFunctionTokenizer.SplitResult, SplitFunctionTokenizer.SplitType -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final SplitFunctionTokenizer.SplitFunctionThe splitting function for whitespace, usingCharacter.isWhitespace(char).Fields inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
splitFunction -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionclone()Clones a tokenizer with it's configuration.com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenanceMethods inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
advance, getEnd, getStart, getText, getType, resetMethods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig
-
Field Details
-
whitespaceSplitCharacterFunction
The splitting function for whitespace, usingCharacter.isWhitespace(char).
-
-
Constructor Details
-
WhitespaceTokenizer
public WhitespaceTokenizer()Constructs a tokenizer that splits on whitespace.
-
-
Method Details
-
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance() -
clone
Description copied from interface:TokenizerClones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.- Specified by:
clonein interfaceTokenizer- Overrides:
clonein classSplitFunctionTokenizer- Returns:
- A tokenizer with the same configuration, but independent state.
-