java.lang.Object

org.tribuo.data.text.TextDataSource<T>

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>, Iterable<Example<T>>, ConfigurableDataSource<T>, DataSource<T>

Direct Known Subclasses:: SimpleTextDataSource

public abstract class TextDataSource<T extends Output<T>> extends Object implements ConfigurableDataSource<T>

A base class for textual data sets. We assume that all textual data is written and read using UTF-8.

Field Summary

Fields

Modifier and Type

Field

Description

protected final List<Example<T>>

data

The actual data read out of the text file.

protected TextFeatureExtractor<T>

extractor

The extractor that we'll use to turn text into examples.

protected OutputFactory<T>

outputFactory

The factory that converts a String into an Output.

protected Path

path

The path that data was read from.

protected List<DocumentPreprocessor>

preprocessors

Document preprocessors that should be run on the documents that make up this data set.
Constructor Summary

Constructors

Modifier

Constructor

Description

protected

TextDataSource()

for olcut

TextDataSource(File file, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors)

Creates a text data set by reading it from a file.

TextDataSource(Path path, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors)

Creates a text data set by reading it from a path.
Method Summary

Modifier and Type

Method

Description

OutputFactory<T>

getOutputFactory()

Returns the output factory used to convert the text input into an Output.

protected String

handleDoc(String doc)

A method that can be overridden to do different things to each document that we've read.

Iterator<Example<T>>

iterator()

protected abstract void

read()

Reads the data from the Path.

String

toString()

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig

Methods inherited from interface java.lang.Iterable
forEach, spliterator

Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable
getProvenance

Field Details
- preprocessors
  
  @Config(description="The document preprocessors to run on each document in the data source.") protected List<DocumentPreprocessor> preprocessors
  
  Document preprocessors that should be run on the documents that make up this data set.
- path
  
  @Config(mandatory=true, description="The path to read the data from.") protected Path path
  
  The path that data was read from.
- outputFactory
  
  @Config(mandatory=true, description="The factory that converts a String into an Output instance.") protected OutputFactory<T extends Output<T>> outputFactory
  
  The factory that converts a String into an Output.
- extractor
  
  @Config(mandatory=true, description="The feature extractor that generates Features from text.") protected TextFeatureExtractor<T extends Output<T>> extractor
  
  The extractor that we'll use to turn text into examples.
- data
  
  protected final List<Example<T extends Output<T>>> data
  
  The actual data read out of the text file.
Constructor Details
- TextDataSource
  
  protected TextDataSource()
  
  for olcut
- TextDataSource
  
  public TextDataSource(Path path, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors)
  
  Creates a text data set by reading it from a path.
  
  Parameters:
  
  path - The path to read data from
  
  outputFactory - The output factory used to generate the outputs.
  
  extractor - The feature extractor to run on the text.
  
  preprocessors - Processors that will be run on the data before it is added as examples.
- TextDataSource
  
  public TextDataSource(File file, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors)
  
  Creates a text data set by reading it from a file.
  
  Parameters:
  
  file - The file to read data from
  
  outputFactory - The output factory used to generate the outputs.
  
  extractor - The feature extractor to run on the text.
  
  preprocessors - Processors that will be run on the data before it is added as examples.
Method Details
- iterator
  
  public Iterator<Example<T>> iterator()
  
  Specified by:
  
  iterator in interface Iterable<T extends Output<T>>
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object
- handleDoc
  
  protected String handleDoc(String doc)
  
  A method that can be overridden to do different things to each document that we've read. By default iterates the preprocessors and applies them to the document.
  
  Parameters:
  
  doc - The document to handle
  
  Returns:
  
  a (possibly modified) version of the document.
- read
  
  protected abstract void read() throws IOException
  
  Reads the data from the Path.
  
  Throws:
  
  IOException - if there is any error reading the data.
- getOutputFactory
  
  public OutputFactory<T> getOutputFactory()
  
  Returns the output factory used to convert the text input into an Output.
  
  Specified by:
  
  getOutputFactory in interface DataSource<T extends Output<T>>
  
  Returns:
  
  The output factory.

Class TextDataSource<T extends Output<T>>

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable

Methods inherited from interface java.lang.Iterable

Methods inherited from interface com.oracle.labs.mlrg.olcut.provenance.Provenancable

Field Details

preprocessors

path

outputFactory

extractor

data

Constructor Details

TextDataSource

TextDataSource

TextDataSource

Method Details

iterator

toString

handleDoc

read

getOutputFactory