Package org.tribuo.util.tokens.impl
Class ShapeTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.ShapeTokenizer
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
,Cloneable
,Tokenizer
This tokenizer is loosely based on the notion of word shape which is a common
feature used in NLP. The idea here is that continuous runs of letters in the
same character class will be grouped together. White space characters are
used as delimiters. The character classes are: uppercase letters, lowercase
letters, digits, and everything else goes into its own character class. So,
for example, "1234abcd" would be split into "1234" and "abcd". And "!@#$"
would result in four tokens. Please see unit tests.
Strings are split according to whitespace and contiguous runs of characters in the same character classes. Except for one exception - if uppercase letters are immediately followed by lowercase letters, then we keep them together. This has the effect of recognizing camel case and splits "CamelCase" into "Camel" and "Case". It also splits "ABCdef AAbb" into "ABCdef" and "AAbb".
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionboolean
advance()
Advances the tokenizer to the next token.clone()
Clones a tokenizer with it's configuration.int
getEnd()
Gets the ending offset (exclusive) of the current token in the character sequencecom.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
int
getStart()
Gets the starting character offset of the current token in the character sequencegetText()
Gets the text of the current token, as a stringgetType()
Gets the type of the current token.void
reset
(CharSequence cs) Resets the tokenizer so that it operates on a new sequence of characters.Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig
-
Constructor Details
-
ShapeTokenizer
public ShapeTokenizer()Constructs a ShapeTokenizer.
-
-
Method Details
-
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()- Specified by:
getProvenance
in interfacecom.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
-
reset
Description copied from interface:Tokenizer
Resets the tokenizer so that it operates on a new sequence of characters. -
advance
public boolean advance()Description copied from interface:Tokenizer
Advances the tokenizer to the next token. -
getText
Description copied from interface:Tokenizer
Gets the text of the current token, as a string -
getStart
public int getStart()Description copied from interface:Tokenizer
Gets the starting character offset of the current token in the character sequence -
getEnd
public int getEnd()Description copied from interface:Tokenizer
Gets the ending offset (exclusive) of the current token in the character sequence -
getType
Description copied from interface:Tokenizer
Gets the type of the current token. -
clone
Description copied from interface:Tokenizer
Clones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
-