Class DirectoryFileSource<T extends Output<T>>

java.lang.Object
org.tribuo.data.text.DirectoryFileSource<T>
Type Parameters:
T - The type of the features built by the underlying text processing infrastructure.
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>, Iterable<Example<T>>, ConfigurableDataSource<T>, DataSource<T>

public class DirectoryFileSource<T extends Output<T>> extends Object implements ConfigurableDataSource<T>
A data source for a somewhat-common format for text classification datasets: a top level directory that contains a number of subdirectories. Each of these subdirectories contains the data for a output whose name is the name of the subdirectory.

In these subdirectories are a number of files. Each file represents a single document that should be labeled with the name of the subdirectory.

This data source will produce appropriately labeled Examples<T> from each of these files.

  • Field Details

    • preprocessors

      @Config(description="The preprocessors to apply to the input documents.") protected List<DocumentPreprocessor> preprocessors
      Document preprocessors that should be run on the documents that make up this data set.
    • outputFactory

      @Config(mandatory=true, description="The output factory to use.") protected OutputFactory<T extends Output<T>> outputFactory
      The factory that converts a String into an Output.
    • extractor

      @Config(mandatory=true, description="The feature extractor that converts text into examples.") protected TextFeatureExtractor<T extends Output<T>> extractor
      The extractor that we'll use to turn text into examples.
  • Constructor Details

    • DirectoryFileSource

      protected DirectoryFileSource()
      for olcut
    • DirectoryFileSource

      public DirectoryFileSource(Path dataDir, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors)
      Creates a data source that will use the given feature extractor and document preprocessors on the data read from the files in the directories representing classes.
      Parameters:
      dataDir - The directory to inspect.
      outputFactory - The output factory used to generate the outputs.
      extractor - The text feature extractor that will run on the documents.
      preprocessors - Pre-processors that we will run on the documents before extracting their features.
  • Method Details