Class RowProcessor<T extends Output<T>>
- Type Parameters:
T
- The output type.
- All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable
,com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
Example
.
It accepts a ResponseProcessor
which converts the response field into an Output
, a Map of
FieldProcessor
s which converts fields into ColumnarFeature
s, and a Set of FeatureProcessor
s
which processes the list of ColumnarFeature
s before Example
construction. Optionally metadata and
weights can be extracted using FieldExtractor
s and written into each example as they are constructed.
If the metadata extractors are invalid (i.e., two extractors write to the same metadata key), the RowProcessor throws
PropertyException
.
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
RowProcessor.Builder<T extends Output<T>>
Builder forRowProcessor
. -
Field Summary
Modifier and TypeFieldDescriptionprotected boolean
Has this row processor been configured?protected Map<String,
FieldProcessor> The map of field processors.protected Map<String,
FieldProcessor> The map of regexes to field processors.protected boolean
Should newlines be replaced with spaces before processing.protected ResponseProcessor<T>
The processor which extracts the response.protected FieldExtractor<Float>
The extractor for the example weight. -
Constructor Summary
ModifierConstructorDescriptionprotected
For olcut.RowProcessor
(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap, Map<String, FieldProcessor> regexMappingProcessors, Set<FeatureProcessor> featureProcessors) Deprecated.RowProcessor
(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap, Map<String, FieldProcessor> regexMappingProcessors, Set<FeatureProcessor> featureProcessors, boolean replaceNewlinesWithSpaces) Deprecated.PreferRowProcessor.Builder
to many-argument constructorsRowProcessor
(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap, Set<FeatureProcessor> featureProcessors) Deprecated.PreferRowProcessor.Builder
to many-argument constructorsRowProcessor
(List<FieldExtractor<?>> metadataExtractors, ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap) Deprecated.PreferRowProcessor.Builder
to many-argument constructorsRowProcessor
(ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap) Constructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.RowProcessor
(ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap, Set<FeatureProcessor> featureProcessors) Constructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed. -
Method Summary
Modifier and TypeMethodDescriptioncopy()
Deprecated.In a future release this API will change, in the meantime this is the correct way to get a row processor with clean state.void
expandRegexMapping
(Collection<String> fieldNames) Uses similar logic toTransformationMap.validateTransformations(org.tribuo.FeatureMap)
to check the regexes against the supplied list of field names.void
expandRegexMapping
(ImmutableFeatureMap featureMap) Uses similar logic toTransformationMap.validateTransformations(org.tribuo.FeatureMap)
to check the regexes against the supplied feature map.void
expandRegexMapping
(Model<T> model) Uses similar logic toTransformationMap.validateTransformations(org.tribuo.FeatureMap)
to check the regexes against theImmutableFeatureMap
contained in the suppliedModel
.generateExample
(long idx, Map<String, String> row, boolean outputRequired) Generate anExample
from the supplied row.generateExample
(Map<String, String> row, boolean outputRequired) Generate anExample
from the supplied row.generateExample
(ColumnarIterator.Row row, boolean outputRequired) Generate anExample
from the supplied row.generateFeatures
(Map<String, String> row) Generates the features from the supplied row.Generates the example metadata from the supplied row and index.The set of column names this will use for the feature processing.Returns a description of the row processor and it's fields.Returns the set ofFeatureProcessor
s this RowProcessor uses.Returns the map ofFieldProcessor
s this RowProcessor uses.Returns the metadata keys and value types that are extracted by this RowProcessor.com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance
Returns the response processor this RowProcessor uses.boolean
Returns true if the regexes have been expanded into field processors.partialExpandRegexMapping
(Collection<String> fieldNames) Caveat Implementor! This method contains the logic ofexpandRegexMapping(org.tribuo.Model<T>)
without any of the checks that ensure the RowProcessor is in a valid state.void
Used by the OLCUT configuration system, and should not be called by external code.toString()
-
Field Details
-
weightExtractor
@Config(description="Extractor for the example weight.") protected FieldExtractor<Float> weightExtractorThe extractor for the example weight. -
responseProcessor
@Config(mandatory=true, description="Processor which extracts the response.") protected ResponseProcessor<T extends Output<T>> responseProcessorThe processor which extracts the response. -
fieldProcessorMap
The map of field processors. -
regexMappingProcessors
@Config(description="A map from a regex to field processors to apply to fields matching the regex.") protected Map<String,FieldProcessor> regexMappingProcessorsThe map of regexes to field processors. -
replaceNewlinesWithSpaces
@Config(description="Replace newlines with spaces in values before passing them to field processors.") protected boolean replaceNewlinesWithSpacesShould newlines be replaced with spaces before processing. -
configured
protected boolean configuredHas this row processor been configured?
-
-
Constructor Details
-
RowProcessor
public RowProcessor(ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap) Constructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.This processor does not generate any additional metadata for the examples, nor does it set the weight value on generated examples.
- Parameters:
responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.
-
RowProcessor
public RowProcessor(ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap, Set<FeatureProcessor> featureProcessors) Constructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.
This processor does not generate any additional metadata for the examples, nor does it set the weight value on generated examples.
- Parameters:
responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.featureProcessors
- The feature processors to run on each extracted feature list.
-
RowProcessor
@Deprecated public RowProcessor(List<FieldExtractor<?>> metadataExtractors, ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap) Deprecated.PreferRowProcessor.Builder
to many-argument constructorsConstructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.Additionally this processor can extract and populate metadata fields on the generated examples (e.g., the row number, date stamps).
- Parameters:
metadataExtractors
- The metadata extractors to run per example. If two metadata extractors emit the same metadata name then the constructor throws a PropertyException.responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.
-
RowProcessor
@Deprecated public RowProcessor(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap, Set<FeatureProcessor> featureProcessors) Deprecated.PreferRowProcessor.Builder
to many-argument constructorsConstructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.
Additionally this processor can extract a weight from each row and insert it into the example, along with more general metadata fields (e.g., the row number, date stamps). The weightExtractor can be null, and if so the weights are left unset.
- Parameters:
metadataExtractors
- The metadata extractors to run per example. If two metadata extractors emit the same metadata name then the constructor throws a PropertyException.weightExtractor
- The weight extractor, if null the weights are left unset at their default.responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.featureProcessors
- The feature processors to run on each extracted feature list.
-
RowProcessor
@Deprecated public RowProcessor(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap, Map<String, FieldProcessor> regexMappingProcessors, Set<FeatureProcessor> featureProcessors) Deprecated.PreferRowProcessor.Builder
to many-argument constructorsConstructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.In addition this processor can instantiate field processors which match the regexes supplied in the regexMappingProcessors. If a regex matches a field which already has a fieldProcessor assigned to it, it throws an IllegalArgumentException.
After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.
Additionally this processor can extract a weight from each row and insert it into the example, along with more general metadata fields (e.g., the row number, date stamps). The weightExtractor can be null, and if so the weights are left unset.
- Parameters:
metadataExtractors
- The metadata extractors to run per example. If two metadata extractors emit the same metadata name then the constructor throws a PropertyException.weightExtractor
- The weight extractor, if null the weights are left unset at their default.responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.regexMappingProcessors
- A set of field processors which can be instantiated if the regexes match the field names.featureProcessors
- The feature processors to run on each extracted feature list.
-
RowProcessor
@Deprecated public RowProcessor(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String, FieldProcessor> fieldProcessorMap, Map<String, FieldProcessor> regexMappingProcessors, Set<FeatureProcessor> featureProcessors, boolean replaceNewlinesWithSpaces) Deprecated.PreferRowProcessor.Builder
to many-argument constructorsConstructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.In addition, this processor can instantiate field processors which match the regexes supplied in the regexMappingProcessors. If a regex matches a field which already has a fieldProcessor assigned to it, it throws an IllegalArgumentException.
After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.
Additionally, this processor can extract a weight from each row and insert it into the example, along with more general metadata fields (e.g., the row number, date stamps). The weightExtractor can be null, and if so the weights are left unset.
- Parameters:
metadataExtractors
- The metadata extractors to run per example. If two metadata extractors emit the same metadata name then the constructor throws a PropertyException.weightExtractor
- The weight extractor, if null the weights are left unset at their default.responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.regexMappingProcessors
- A set of field processors which can be instantiated if the regexes match the field names.featureProcessors
- The feature processors to run on each extracted feature list.replaceNewlinesWithSpaces
- Replace newlines with spaces in values before passing them to field processors.
-
RowProcessor
protected RowProcessor()For olcut.
-
-
Method Details
-
postConfig
public void postConfig()Used by the OLCUT configuration system, and should not be called by external code.- Specified by:
postConfig
in interfacecom.oracle.labs.mlrg.olcut.config.Configurable
-
getResponseProcessor
Returns the response processor this RowProcessor uses.- Returns:
- The response processor.
-
getFieldProcessors
Returns the map ofFieldProcessor
s this RowProcessor uses.- Returns:
- The field processors.
-
getFeatureProcessors
Returns the set ofFeatureProcessor
s this RowProcessor uses.- Returns:
- The feature processors.
-
generateExample
Generate anExample
from the supplied row. Returns an empty Optional if there are no features, or the response is required but it was not found. The latter case is used at training time.- Parameters:
row
- The row to process.outputRequired
- If an Output must be found in the row to return an Example.- Returns:
- An Optional containing an Example if the row was valid, an empty Optional otherwise.
-
generateExample
Generate anExample
from the supplied row. Returns an empty Optional if there are no features, or the response is required but it was not found.Supplies -1 as the example index, used in cases where the index isn't meaningful.
- Parameters:
row
- The row to process.outputRequired
- If an Output must be found in the row to return an Example.- Returns:
- An Optional containing an Example if the row was valid, an empty Optional otherwise.
-
generateExample
public Optional<Example<T>> generateExample(long idx, Map<String, String> row, boolean outputRequired) Generate anExample
from the supplied row. Returns an empty Optional if there are no features, or the response is required but it was not found. The latter case is used at training time.- Parameters:
idx
- The index for use in the example metadata if desired.row
- The row to process.outputRequired
- If an Output must be found in the row to return an Example.- Returns:
- An Optional containing an Example if the row was valid, an empty Optional otherwise.
-
generateMetadata
Generates the example metadata from the supplied row and index.- Parameters:
row
- The row to process.- Returns:
- A (possibly empty) map containing the metadata.
-
generateFeatures
Generates the features from the supplied row.- Parameters:
row
- The row to process.- Returns:
- A (possibly empty) list of
ColumnarFeature
s.
-
getColumnNames
The set of column names this will use for the feature processing.- Returns:
- The set of column names it processes.
-
getDescription
Returns a description of the row processor and it's fields.- Returns:
- A String description of the RowProcessor.
-
toString
-
getMetadataTypes
Returns the metadata keys and value types that are extracted by this RowProcessor.- Returns:
- The metadata keys and value types.
-
isConfigured
public boolean isConfigured()Returns true if the regexes have been expanded into field processors.- Returns:
- True if the RowProcessor has seen the set of input fields.
-
expandRegexMapping
Uses similar logic toTransformationMap.validateTransformations(org.tribuo.FeatureMap)
to check the regexes against theImmutableFeatureMap
contained in the suppliedModel
. Throws an IllegalArgumentException if any regexes overlap with themselves, or with the currently defined set of fieldProcessorMap.- Parameters:
model
- The model to use to expand the regexes.
-
expandRegexMapping
Uses similar logic toTransformationMap.validateTransformations(org.tribuo.FeatureMap)
to check the regexes against the supplied feature map. Throws an IllegalArgumentException if any regexes overlap with themselves, or with the currently defined set of fieldProcessorMap.- Parameters:
featureMap
- The feature map to use to expand the regexes.
-
expandRegexMapping
Uses similar logic toTransformationMap.validateTransformations(org.tribuo.FeatureMap)
to check the regexes against the supplied list of field names. Throws an IllegalArgumentException if any regexes overlap with themselves, or with the currently defined set of fieldProcessorMap or if there are unmatched regexes after processing.- Parameters:
fieldNames
- The list of field names.
-
partialExpandRegexMapping
Caveat Implementor! This method contains the logic ofexpandRegexMapping(org.tribuo.Model<T>)
without any of the checks that ensure the RowProcessor is in a valid state. This can be overriden in a subclass to expand a regex mapping several times for a single instance of RowProcessor. The caller is responsible for ensuring that fieldNames are not duplicated within or between calls.- Parameters:
fieldNames
- The list of field names - should contain only previously unseen field names.- Returns:
- the set of regexes that were matched by fieldNames.
-
copy
Deprecated.In a future release this API will change, in the meantime this is the correct way to get a row processor with clean state.When using regexMappingProcessors, RowProcessor is stateful in a way that can sometimes make it fail the second time it is used. Concretely:
RowProcessor rp; Dataset ds1 = new MutableDataset(new CSVDataSource(csvfile1, rp)); Dataset ds2 = new MutableDataset(new CSVDataSource(csvfile2, rp)); // this may fail due to state in rp
This method returns a RowProcessor with clean state and the same configuration as this row processor.- Returns:
- a RowProcessor instance with clean state and the same configuration as this row processor.
-
getProvenance
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
-
RowProcessor.Builder
to many-argument constructors