Class TextDataSource<T extends Output<T>>

java.lang.Object
org.tribuo.data.text.TextDataSource<T>
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>, Iterable<Example<T>>, ConfigurableDataSource<T>, DataSource<T>
Direct Known Subclasses:
SimpleTextDataSource

public abstract class TextDataSource<T extends Output<T>> extends Object implements ConfigurableDataSource<T>
A base class for textual data sets. We assume that all textual data is written and read using UTF-8.
  • Field Details

    • preprocessors

      @Config(description="The document preprocessors to run on each document in the data source.") protected List<DocumentPreprocessor> preprocessors
      Document preprocessors that should be run on the documents that make up this data set.
    • path

      @Config(mandatory=true, description="The path to read the data from.") protected Path path
      The path that data was read from.
    • outputFactory

      @Config(mandatory=true, description="The factory that converts a String into an Output instance.") protected OutputFactory<T extends Output<T>> outputFactory
      The factory that converts a String into an Output.
    • extractor

      @Config(mandatory=true, description="The feature extractor that generates Features from text.") protected TextFeatureExtractor<T extends Output<T>> extractor
      The extractor that we'll use to turn text into examples.
    • data

      protected final List<Example<T extends Output<T>>> data
      The actual data read out of the text file.
  • Constructor Details

    • TextDataSource

      protected TextDataSource()
      for olcut
    • TextDataSource

      public TextDataSource(Path path, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors)
      Creates a text data set by reading it from a path.
      Parameters:
      path - The path to read data from
      outputFactory - The output factory used to generate the outputs.
      extractor - The feature extractor to run on the text.
      preprocessors - Processors that will be run on the data before it is added as examples.
    • TextDataSource

      public TextDataSource(File file, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors)
      Creates a text data set by reading it from a file.
      Parameters:
      file - The file to read data from
      outputFactory - The output factory used to generate the outputs.
      extractor - The feature extractor to run on the text.
      preprocessors - Processors that will be run on the data before it is added as examples.
  • Method Details

    • iterator

      public Iterator<Example<T>> iterator()
      Specified by:
      iterator in interface Iterable<T extends Output<T>>
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • handleDoc

      protected String handleDoc(String doc)
      A method that can be overridden to do different things to each document that we've read. By default iterates the preprocessors and applies them to the document.
      Parameters:
      doc - The document to handle
      Returns:
      a (possibly modified) version of the document.
    • read

      protected abstract void read() throws IOException
      Reads the data from the Path.
      Throws:
      IOException - if there is any error reading the data.
    • getOutputFactory

      public OutputFactory<T> getOutputFactory()
      Returns the output factory used to convert the text input into an Output.
      Specified by:
      getOutputFactory in interface DataSource<T extends Output<T>>
      Returns:
      The output factory.