Class SimpleTextDataSource<T extends Output<T>>

java.lang.Object
org.tribuo.data.text.TextDataSource<T>
org.tribuo.data.text.impl.SimpleTextDataSource<T>
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>, Iterable<Example<T>>, ConfigurableDataSource<T>, DataSource<T>
Direct Known Subclasses:
SimpleStringDataSource

public class SimpleTextDataSource<T extends Output<T>> extends TextDataSource<T>
A dataset for a simple data format for text classification experiments. A line in the file looks like:
 OUTPUT##Document text
 
Each line in the file specifies a single output and document pair. Leading and trailing spaces will be trimmed from outputs and documents. Outputs will be converted to upper case.

As with all of our text data, the file should be in UTF-8.

  • Field Details

  • Constructor Details

    • SimpleTextDataSource

      protected SimpleTextDataSource()
      for olcut
    • SimpleTextDataSource

      public SimpleTextDataSource(Path path, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor) throws IOException
      Constructs a simple text data source by reading lines from the supplied path.
      Parameters:
      path - The path to load.
      outputFactory - The output factory to use.
      extractor - The feature extractor.
      Throws:
      IOException - If the path could not be read.
    • SimpleTextDataSource

      public SimpleTextDataSource(File file, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor) throws IOException
      Constructs a simple text data source by reading lines from the supplied file.
      Parameters:
      file - The file to load.
      outputFactory - The output factory to use.
      extractor - The feature extractor.
      Throws:
      IOException - If the file could not be read.
    • SimpleTextDataSource

      protected SimpleTextDataSource(OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor)
      Cosntructs a data source without a path.
      Parameters:
      outputFactory - The output factory.
      extractor - The text extraction pipeline.
  • Method Details

    • postConfig

      public void postConfig() throws IOException
      Used by the OLCUT configuration system, and should not be called by external code.
      Throws:
      IOException
    • parseLine

      protected Optional<Example<T>> parseLine(String line, int n)
      Parses a line in Tribuo's default text format.
      Parameters:
      line - The line to parse.
      n - The current line number.
      Returns:
      An example or an empty optional if it failed to parse.
    • read

      protected void read() throws IOException
      Description copied from class: TextDataSource
      Reads the data from the Path.
      Specified by:
      read in class TextDataSource<T extends Output<T>>
      Throws:
      IOException - if there is any error reading the data.
    • getProvenance

      public ConfiguredDataSourceProvenance getProvenance()
    • cacheProvenance

      protected ConfiguredDataSourceProvenance cacheProvenance()
      Computes the provenance.
      Returns:
      The provenance.