Class ShapeTokenizer
java.lang.Object
org.tribuo.util.tokens.impl.ShapeTokenizer
- All Implemented Interfaces:
- com.oracle.labs.mlrg.olcut.config.Configurable,- com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>,- Cloneable,- Tokenizer
This tokenizer is loosely based on the notion of word shape which is a common
 feature used in NLP. The idea here is that continuous runs of letters in the
 same character class will be grouped together. White space characters are
 used as delimiters. The character classes are: uppercase letters, lowercase
 letters, digits, and everything else goes into its own character class. So,
 for example, "1234abcd" would be split into "1234" and "abcd". And "!@#$"
 would result in four tokens. Please see unit tests.
 
Strings are split according to whitespace and contiguous runs of characters in the same character classes. Except for one exception - if uppercase letters are immediately followed by lowercase letters, then we keep them together. This has the effect of recognizing camel case and splits "CamelCase" into "Camel" and "Case". It also splits "ABCdef AAbb" into "ABCdef" and "AAbb".
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionbooleanadvance()Advances the tokenizer to the next token.clone()Clones a tokenizer with it's configuration.intgetEnd()Gets the ending offset (exclusive) of the current token in the character sequencecom.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenanceintgetStart()Gets the starting character offset of the current token in the character sequencegetText()Gets the text of the current token, as a stringgetType()Gets the type of the current token.voidreset(CharSequence cs) Resets the tokenizer so that it operates on a new sequence of characters.Methods inherited from class java.lang.Objectequals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface com.oracle.labs.mlrg.olcut.config.ConfigurablepostConfig
- 
Constructor Details- 
ShapeTokenizerpublic ShapeTokenizer()Constructs a ShapeTokenizer.
 
- 
- 
Method Details- 
getProvenancepublic com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()- Specified by:
- getProvenancein interface- com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
 
- 
resetDescription copied from interface:TokenizerResets the tokenizer so that it operates on a new sequence of characters.
- 
advance
- 
getText
- 
getStart
- 
getEnd
- 
getTypeDescription copied from interface:TokenizerGets the type of the current token.
- 
cloneDescription copied from interface:TokenizerClones a tokenizer with it's configuration. Cloned tokenizers are not processing the same text as the original tokenizer and need to be reset with a fresh CharSequence.
 
-