Package org.tribuo.data.text
Class TextDataSource<T extends Output<T>>
java.lang.Object
org.tribuo.data.text.TextDataSource<T>
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>
,Iterable<Example<T>>
,ConfigurableDataSource<T>
,DataSource<T>
- Direct Known Subclasses:
SimpleTextDataSource
public abstract class TextDataSource<T extends Output<T>>
extends Object
implements ConfigurableDataSource<T>
A base class for textual data sets. We assume that all textual data is
written and read using UTF-8.
-
Field Summary
Modifier and TypeFieldDescriptionThe actual data read out of the text file.protected TextFeatureExtractor<T>
The extractor that we'll use to turn text into examples.protected OutputFactory<T>
The factory that converts a String into anOutput
.protected Path
The path that data was read from.protected List<DocumentPreprocessor>
Document preprocessors that should be run on the documents that make up this data set. -
Constructor Summary
ModifierConstructorDescriptionprotected
for olcutTextDataSource
(File file, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors) Creates a text data set by reading it from a file.TextDataSource
(Path path, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors) Creates a text data set by reading it from a path. -
Method Summary
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig
Methods inherited from interface java.lang.Iterable
forEach, spliterator
Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable
getProvenance
-
Field Details
-
preprocessors
@Config(description="The document preprocessors to run on each document in the data source.") protected List<DocumentPreprocessor> preprocessorsDocument preprocessors that should be run on the documents that make up this data set. -
path
The path that data was read from. -
outputFactory
@Config(mandatory=true, description="The factory that converts a String into an Output instance.") protected OutputFactory<T extends Output<T>> outputFactoryThe factory that converts a String into anOutput
. -
extractor
@Config(mandatory=true, description="The feature extractor that generates Features from text.") protected TextFeatureExtractor<T extends Output<T>> extractorThe extractor that we'll use to turn text into examples. -
data
The actual data read out of the text file.
-
-
Constructor Details
-
TextDataSource
protected TextDataSource()for olcut -
TextDataSource
public TextDataSource(Path path, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors) Creates a text data set by reading it from a path.- Parameters:
path
- The path to read data fromoutputFactory
- The output factory used to generate the outputs.extractor
- The feature extractor to run on the text.preprocessors
- Processors that will be run on the data before it is added as examples.
-
TextDataSource
public TextDataSource(File file, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors) Creates a text data set by reading it from a file.- Parameters:
file
- The file to read data fromoutputFactory
- The output factory used to generate the outputs.extractor
- The feature extractor to run on the text.preprocessors
- Processors that will be run on the data before it is added as examples.
-
-
Method Details
-
iterator
-
toString
-
handleDoc
A method that can be overridden to do different things to each document that we've read. By default iterates the preprocessors and applies them to the document.- Parameters:
doc
- The document to handle- Returns:
- a (possibly modified) version of the document.
-
read
Reads the data from the Path.- Throws:
IOException
- if there is any error reading the data.
-
getOutputFactory
Returns the output factory used to convert the text input into anOutput
.- Specified by:
getOutputFactory
in interfaceDataSource<T extends Output<T>>
- Returns:
- The output factory.
-