org.tribuo.data.text.DirectoryFileSource<T>

Type Parameters:: T - The type of the features built by the underlying text processing infrastructure.

All Implemented Interfaces:: com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>, Iterable<Example<T>>, ConfigurableDataSource<T>, DataSource<T>

public class DirectoryFileSource<T extends Output<T>> extends Object implements ConfigurableDataSource<T>

A data source for a somewhat-common format for text classification datasets: a top level directory that contains a number of subdirectories. Each of these subdirectories contains the data for a output whose name is the name of the subdirectory.

In these subdirectories are a number of files. Each file represents a single document that should be labeled with the name of the subdirectory.

This data source will produce appropriately labeled Examples<T> from each of these files.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

DirectoryFileSource.DirectoryFileSourceProvenance

Provenance for DirectoryFileSource.
Field Summary

Fields

Modifier and Type

Field

Description

protected TextFeatureExtractor<T>

extractor

The extractor that we'll use to turn text into examples.

protected OutputFactory<T>

outputFactory

The factory that converts a String into an Output.

protected List<DocumentPreprocessor>

preprocessors

Document preprocessors that should be run on the documents that make up this data set.
Constructor Summary

Constructors

Modifier

Constructor

Description

protected

DirectoryFileSource()

for olcut

DirectoryFileSource(Path dataDir, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors)

Creates a data source that will use the given feature extractor and document preprocessors on the data read from the files in the directories representing classes.
Method Summary

Modifier and Type

Method

Description

OutputFactory<T>

getOutputFactory()

Returns the OutputFactory associated with this Output subclass.

ConfiguredDataSourceProvenance

getProvenance()

Iterator<Example<T>>

iterator()

String

toString()

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig

Methods inherited from interface java.lang.Iterable
forEach, spliterator

Field Details
- preprocessors
  
  @Config(description="The preprocessors to apply to the input documents.") protected List<DocumentPreprocessor> preprocessors
  
  Document preprocessors that should be run on the documents that make up this data set.
- outputFactory
  
  @Config(mandatory=true, description="The output factory to use.") protected OutputFactory<T extends Output<T>> outputFactory
  
  The factory that converts a String into an Output.
- extractor
  
  @Config(mandatory=true, description="The feature extractor that converts text into examples.") protected TextFeatureExtractor<T extends Output<T>> extractor
  
  The extractor that we'll use to turn text into examples.
Constructor Details
- DirectoryFileSource
  
  protected DirectoryFileSource()
  
  for olcut
- DirectoryFileSource
  
  public DirectoryFileSource(Path dataDir, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors)
  
  Creates a data source that will use the given feature extractor and document preprocessors on the data read from the files in the directories representing classes.
  
  Parameters:
  
  dataDir - The directory to inspect.
  
  outputFactory - The output factory used to generate the outputs.
  
  extractor - The text feature extractor that will run on the documents.
  
  preprocessors - Pre-processors that we will run on the documents before extracting their features.
Method Details
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object
- getOutputFactory
  
  public OutputFactory<T> getOutputFactory()
  
  Description copied from interface: DataSource
  
  Returns the OutputFactory associated with this Output subclass.
  
  Specified by:
  
  getOutputFactory in interface DataSource<T extends Output<T>>
  
  Returns:
  
  The output factory.
- iterator
  
  public Iterator<Example<T>> iterator()
  
  Specified by:
  
  iterator in interface Iterable<T extends Output<T>>
- getProvenance
  
  public ConfiguredDataSourceProvenance getProvenance()
  
  Specified by:
  
  getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<T extends Output<T>>

Class DirectoryFileSource<T extends Output<T>>

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable

Methods inherited from interface java.lang.Iterable

Field Details

preprocessors

outputFactory

extractor

Constructor Details

DirectoryFileSource

DirectoryFileSource

Method Details

toString

getOutputFactory

iterator

getProvenance