Class DirectoryFileSource<T extends Output<T>>
java.lang.Object
org.tribuo.data.text.DirectoryFileSource<T>
- Type Parameters:
T- The type of the features built by the underlying text processing infrastructure.
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable,com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>,Iterable<Example<T>>,ConfigurableDataSource<T>,DataSource<T>
public class DirectoryFileSource<T extends Output<T>>
extends Object
implements ConfigurableDataSource<T>
A data source for a somewhat-common format for text classification datasets:
a top level directory that contains a number of subdirectories. Each of these
subdirectories contains the data for a output whose name is the name of the
subdirectory.
In these subdirectories are a number of files. Each file represents a single document that should be labeled with the name of the subdirectory.
This data source will produce appropriately labeled Examples<T>
from each of these files.
-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected TextFeatureExtractor<T> The extractor that we'll use to turn text into examples.protected OutputFactory<T> The factory that converts a String into anOutput.protected List<DocumentPreprocessor> Document preprocessors that should be run on the documents that make up this data set. -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedfor olcutDirectoryFileSource(Path dataDir, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors) Creates a data source that will use the given feature extractor and document preprocessors on the data read from the files in the directories representing classes. -
Method Summary
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, waitMethods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfigMethods inherited from interface java.lang.Iterable
forEach, spliterator
-
Field Details
-
preprocessors
@Config(description="The preprocessors to apply to the input documents.") protected List<DocumentPreprocessor> preprocessorsDocument preprocessors that should be run on the documents that make up this data set. -
outputFactory
@Config(mandatory=true, description="The output factory to use.") protected OutputFactory<T extends Output<T>> outputFactoryThe factory that converts a String into anOutput. -
extractor
@Config(mandatory=true, description="The feature extractor that converts text into examples.") protected TextFeatureExtractor<T extends Output<T>> extractorThe extractor that we'll use to turn text into examples.
-
-
Constructor Details
-
DirectoryFileSource
protected DirectoryFileSource()for olcut -
DirectoryFileSource
public DirectoryFileSource(Path dataDir, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors) Creates a data source that will use the given feature extractor and document preprocessors on the data read from the files in the directories representing classes.- Parameters:
dataDir- The directory to inspect.outputFactory- The output factory used to generate the outputs.extractor- The text feature extractor that will run on the documents.preprocessors- Pre-processors that we will run on the documents before extracting their features.
-
-
Method Details
-
toString
-
getOutputFactory
Description copied from interface:DataSourceReturns the OutputFactory associated with this Output subclass.- Specified by:
getOutputFactoryin interfaceDataSource<T extends Output<T>>- Returns:
- The output factory.
-
iterator
-
getProvenance
-