Package org.tribuo.data.text
Class DirectoryFileSource<T extends Output<T>>
java.lang.Object
org.tribuo.data.text.DirectoryFileSource<T>
- Type Parameters:
T
- The type of the features built by the underlying text processing infrastructure.
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<DataSourceProvenance>
,Iterable<Example<T>>
,ConfigurableDataSource<T>
,DataSource<T>
public class DirectoryFileSource<T extends Output<T>>
extends Object
implements ConfigurableDataSource<T>
A data source for a somewhat-common format for text classification datasets:
a top level directory that contains a number of subdirectories. Each of these
subdirectories contains the data for a output whose name is the name of the
subdirectory.
In these subdirectories are a number of files. Each file represents a single document that should be labeled with the name of the subdirectory.
This data source will produce appropriately labeled Examples<T>
from each of these files.
-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionprotected TextFeatureExtractor<T>
The extractor that we'll use to turn text into examples.protected OutputFactory<T>
The factory that converts a String into anOutput
.protected List<DocumentPreprocessor>
Document preprocessors that should be run on the documents that make up this data set. -
Constructor Summary
ModifierConstructorDescriptionprotected
for olcutDirectoryFileSource
(Path dataDir, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors) Creates a data source that will use the given feature extractor and document preprocessors on the data read from the files in the directories representing classes. -
Method Summary
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface com.oracle.labs.mlrg.olcut.config.Configurable
postConfig
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Field Details
-
preprocessors
@Config(description="The preprocessors to apply to the input documents.") protected List<DocumentPreprocessor> preprocessorsDocument preprocessors that should be run on the documents that make up this data set. -
outputFactory
@Config(mandatory=true, description="The output factory to use.") protected OutputFactory<T extends Output<T>> outputFactoryThe factory that converts a String into anOutput
. -
extractor
@Config(mandatory=true, description="The feature extractor that converts text into examples.") protected TextFeatureExtractor<T extends Output<T>> extractorThe extractor that we'll use to turn text into examples.
-
-
Constructor Details
-
DirectoryFileSource
protected DirectoryFileSource()for olcut -
DirectoryFileSource
public DirectoryFileSource(Path dataDir, OutputFactory<T> outputFactory, TextFeatureExtractor<T> extractor, DocumentPreprocessor... preprocessors) Creates a data source that will use the given feature extractor and document preprocessors on the data read from the files in the directories representing classes.- Parameters:
dataDir
- The directory to inspect.outputFactory
- The output factory used to generate the outputs.extractor
- The text feature extractor that will run on the documents.preprocessors
- Pre-processors that we will run on the documents before extracting their features.
-
-
Method Details
-
toString
-
getOutputFactory
Description copied from interface:DataSource
Returns the OutputFactory associated with this Output subclass.- Specified by:
getOutputFactory
in interfaceDataSource<T extends Output<T>>
- Returns:
- The output factory.
-
iterator
-
getProvenance
-