Package org.tribuo.data.text.impl
Class SimpleTextDataSource<T extends Output<T>>
java.lang.Object
org.tribuo.data.text.TextDataSource<T>
org.tribuo.data.text.impl.SimpleTextDataSource<T>
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>
,Iterable<Example<T>>
,ConfigurableDataSource<T>
,DataSource<T>
- Direct Known Subclasses:
SimpleStringDataSource
A dataset for a simple data format for text classification experiments. A line
in the file looks like:
OUTPUT##Document textEach line in the file specifies a single output and document pair. Leading and trailing spaces will be trimmed from outputs and documents. Outputs will be converted to upper case.
As with all of our text data, the file should be in UTF-8.
-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionprotected ConfiguredDataSourceProvenance
The data source provenance.Fields inherited from class org.tribuo.data.text.TextDataSource
data, extractor, outputFactory, path, preprocessors
-
Constructor Summary
ModifierConstructorDescriptionprotected
for olcutSimpleTextDataSource
(File file, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor) Constructs a simple text data source by reading lines from the supplied file.SimpleTextDataSource
(Path path, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor) Constructs a simple text data source by reading lines from the supplied path.protected
SimpleTextDataSource
(OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor) Cosntructs a data source without a path. -
Method Summary
Modifier and TypeMethodDescriptionprotected ConfiguredDataSourceProvenance
Computes the provenance.Parses a line in Tribuo's default text format.void
Used by the OLCUT configuration system, and should not be called by external code.protected void
read()
Reads the data from the Path.Methods inherited from class org.tribuo.data.text.TextDataSource
getOutputFactory, handleDoc, iterator, toString
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Field Details
-
provenance
The data source provenance.
-
-
Constructor Details
-
SimpleTextDataSource
protected SimpleTextDataSource()for olcut -
SimpleTextDataSource
public SimpleTextDataSource(Path path, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor) throws IOException Constructs a simple text data source by reading lines from the supplied path.- Parameters:
path
- The path to load.outputFactory
- The output factory to use.extractor
- The feature extractor.- Throws:
IOException
- If the path could not be read.
-
SimpleTextDataSource
public SimpleTextDataSource(File file, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor) throws IOException Constructs a simple text data source by reading lines from the supplied file.- Parameters:
file
- The file to load.outputFactory
- The output factory to use.extractor
- The feature extractor.- Throws:
IOException
- If the file could not be read.
-
SimpleTextDataSource
Cosntructs a data source without a path.- Parameters:
outputFactory
- The output factory.extractor
- The text extraction pipeline.
-
-
Method Details
-
postConfig
Used by the OLCUT configuration system, and should not be called by external code.- Throws:
IOException
-
parseLine
Parses a line in Tribuo's default text format.- Parameters:
line
- The line to parse.n
- The current line number.- Returns:
- An example or an empty optional if it failed to parse.
-
read
Description copied from class:TextDataSource
Reads the data from the Path.- Specified by:
read
in classTextDataSource<T extends Output<T>>
- Throws:
IOException
- if there is any error reading the data.
-
getProvenance
-
cacheProvenance
Computes the provenance.- Returns:
- The provenance.
-