Package org.tribuo.util.tokens.impl
Class WhitespaceTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.SplitFunctionTokenizer
org.tribuo.util.tokens.impl.WhitespaceTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
,Cloneable
,Tokenizer
A simple tokenizer that splits on whitespace. This tokenizer does not create
tokens that correspond to whitespace - only those spans of text delimited by
whitespace. For example, the text "a b" will result in two tokens "a" and "b".
-
Nested Class Summary
Nested classes/interfaces inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
SplitFunctionTokenizer.SplitFunction, SplitFunctionTokenizer.SplitResult, SplitFunctionTokenizer.SplitType
-
Field Summary
Modifier and TypeFieldDescriptionstatic final SplitFunctionTokenizer.SplitFunction
The splitting function for whitespace, usingCharacter.isWhitespace(char)
.Fields inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
splitFunction
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionclone()
Clones a tokenizer with it's configuration.com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
Methods inherited from class org.tribuo.util.tokens.impl.SplitFunctionTokenizer
advance, getEnd, getStart, getText, getType, reset
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig
-
Field Details
-
whitespaceSplitCharacterFunction
The splitting function for whitespace, usingCharacter.isWhitespace(char)
.
-
-
Constructor Details
-
WhitespaceTokenizer
public WhitespaceTokenizer()Constructs a tokenizer that splits on whitespace.
-
-
Method Details
-
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance() -
clone
Description copied from interface:Tokenizer
Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.- Specified by:
clone
in interfaceTokenizer
- Overrides:
clone
in classSplitFunctionTokenizer
- Returns:
- A tokenizer with the same configuration, but independent state.
-