public class RowProcessor<T extends Output<T>> extends Object implements com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
Example
.
It accepts a ResponseProcessor
which converts the response field into an Output
,
a Map of FieldProcessor
s which converts fields into ColumnarFeature
s, and a Set
of FeatureProcessor
s which processes the list of ColumnarFeature
s before Example
construction. Optionally metadata and weights can be extracted using FieldExtractor
s
and written into each example as they are constructed.
If the metadata extractors are invalid (i.e., two extractors write to the same metadata key),
the RowProcessor throws PropertyException
.
Modifier and Type | Field and Description |
---|---|
protected boolean |
configured |
protected Map<String,FieldProcessor> |
fieldProcessorMap |
protected Map<String,FieldProcessor> |
regexMappingProcessors |
protected boolean |
replaceNewlinesWithSpaces |
protected ResponseProcessor<T> |
responseProcessor |
protected FieldExtractor<Float> |
weightExtractor |
Modifier | Constructor and Description |
---|---|
protected |
RowProcessor()
For olcut.
|
|
RowProcessor(List<FieldExtractor<?>> metadataExtractors,
FieldExtractor<Float> weightExtractor,
ResponseProcessor<T> responseProcessor,
Map<String,FieldProcessor> fieldProcessorMap,
Map<String,FieldProcessor> regexMappingProcessors,
Set<FeatureProcessor> featureProcessors)
Constructs a RowProcessor using the supplied responseProcessor to extract the response variable,
and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.
|
|
RowProcessor(List<FieldExtractor<?>> metadataExtractors,
FieldExtractor<Float> weightExtractor,
ResponseProcessor<T> responseProcessor,
Map<String,FieldProcessor> fieldProcessorMap,
Map<String,FieldProcessor> regexMappingProcessors,
Set<FeatureProcessor> featureProcessors,
boolean replaceNewlinesWithSpaces)
Constructs a RowProcessor using the supplied responseProcessor to extract the response variable,
and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.
|
|
RowProcessor(List<FieldExtractor<?>> metadataExtractors,
FieldExtractor<Float> weightExtractor,
ResponseProcessor<T> responseProcessor,
Map<String,FieldProcessor> fieldProcessorMap,
Set<FeatureProcessor> featureProcessors)
Constructs a RowProcessor using the supplied responseProcessor to extract the response variable,
and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.
|
|
RowProcessor(List<FieldExtractor<?>> metadataExtractors,
ResponseProcessor<T> responseProcessor,
Map<String,FieldProcessor> fieldProcessorMap)
Constructs a RowProcessor using the supplied responseProcessor to extract the response variable,
and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.
|
|
RowProcessor(ResponseProcessor<T> responseProcessor,
Map<String,FieldProcessor> fieldProcessorMap)
Constructs a RowProcessor using the supplied responseProcessor to extract the response variable,
and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.
|
|
RowProcessor(ResponseProcessor<T> responseProcessor,
Map<String,FieldProcessor> fieldProcessorMap,
Set<FeatureProcessor> featureProcessors)
Constructs a RowProcessor using the supplied responseProcessor to extract the response variable,
and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.
|
Modifier and Type | Method and Description |
---|---|
RowProcessor<T> |
copy()
Deprecated.
In a future release this API will change, in the meantime this is the correct way to get a row
processor with clean state.
When using regexMappingProcessors, RowProcessor is stateful in a way that can sometimes make it fail the second time it is used. Concretely: RowProcessor rp; Dataset ds1 = new MutableDataset(new CSVDataSource(csvfile1, rp)); Dataset ds2 = new MutableDataset(new CSVDataSource(csvfile2, rp)); // this may fail due to state in rpThis method returns a RowProcessor with clean state and the same configuration as this row processor. |
void |
expandRegexMapping(Collection<String> fieldNames)
Uses similar logic to
TransformationMap.validateTransformations(org.tribuo.FeatureMap) to check the regexes
against the supplied list of field names. |
void |
expandRegexMapping(ImmutableFeatureMap featureMap)
Uses similar logic to
TransformationMap.validateTransformations(org.tribuo.FeatureMap) to check the regexes
against the supplied feature map. |
void |
expandRegexMapping(Model<T> model)
Uses similar logic to
TransformationMap.validateTransformations(org.tribuo.FeatureMap) to check the regexes
against the ImmutableFeatureMap contained in the supplied Model . |
Optional<Example<T>> |
generateExample(ColumnarIterator.Row row,
boolean outputRequired)
Generate an
Example from the supplied row. |
Optional<Example<T>> |
generateExample(long idx,
Map<String,String> row,
boolean outputRequired)
Generate an
Example from the supplied row. |
Optional<Example<T>> |
generateExample(Map<String,String> row,
boolean outputRequired)
Generate an
Example from the supplied row. |
List<ColumnarFeature> |
generateFeatures(Map<String,String> row)
Generates the features from the supplied row.
|
Map<String,Object> |
generateMetadata(ColumnarIterator.Row row)
Generates the example metadata from the supplied row and index.
|
Set<String> |
getColumnNames()
The set of column names this will use for the feature processing.
|
String |
getDescription()
Returns a description of the row processor and it's fields.
|
Set<FeatureProcessor> |
getFeatureProcessors()
Returns the set of
FeatureProcessor s this RowProcessor uses. |
Map<String,FieldProcessor> |
getFieldProcessors()
Returns the map of
FieldProcessor s this RowProcessor uses. |
Map<String,Class<?>> |
getMetadataTypes()
Returns the metadata keys and value types that are extracted
by this RowProcessor.
|
com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance |
getProvenance() |
ResponseProcessor<T> |
getResponseProcessor()
Returns the response processor this RowProcessor uses.
|
boolean |
isConfigured()
Returns true if the regexes have been expanded into field processors.
|
protected Set<String> |
partialExpandRegexMapping(Collection<String> fieldNames)
Caveat Implementor! This method contains the logic of
expandRegexMapping(org.tribuo.Model<T>)
without any of the checks that ensure the RowProcessor is in a valid state. |
void |
postConfig()
Used by the OLCUT configuration system, and should not be called by external code.
|
String |
toString() |
@Config(description="Extractor for the example weight.") protected FieldExtractor<Float> weightExtractor
@Config(mandatory=true, description="Processor which extracts the response.") protected ResponseProcessor<T extends Output<T>> responseProcessor
protected Map<String,FieldProcessor> fieldProcessorMap
@Config(description="A map from a regex to field processors to apply to fields matching the regex.") protected Map<String,FieldProcessor> regexMappingProcessors
@Config(description="Replace newlines with spaces in values before passing them to field processors.") protected boolean replaceNewlinesWithSpaces
protected boolean configured
public RowProcessor(ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap)
This processor does not generate any additional metadata for the examples, nor does it set the weight value on generated examples.
responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.public RowProcessor(ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap, Set<FeatureProcessor> featureProcessors)
After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.
This processor does not generate any additional metadata for the examples, nor does it set the weight value on generated examples.
responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.featureProcessors
- The feature processors to run on each extracted feature list.public RowProcessor(List<FieldExtractor<?>> metadataExtractors, ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap)
Additionally this processor can extract and populate metadata fields on the generated examples (e.g., the row number, date stamps).
metadataExtractors
- The metadata extractors to run per example. If two metadata extractors emit
the same metadata name then the constructor throws a PropertyException.responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.public RowProcessor(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap, Set<FeatureProcessor> featureProcessors)
After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.
Additionally this processor can extract a weight from each row and insert it into the example, along with more general metadata fields (e.g., the row number, date stamps). The weightExtractor can be null, and if so the weights are left unset.
metadataExtractors
- The metadata extractors to run per example. If two metadata extractors emit
the same metadata name then the constructor throws a PropertyException.weightExtractor
- The weight extractor, if null the weights are left unset at their default.responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.featureProcessors
- The feature processors to run on each extracted feature list.public RowProcessor(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap, Map<String,FieldProcessor> regexMappingProcessors, Set<FeatureProcessor> featureProcessors)
In addition this processor can instantiate field processors which match the regexes supplied in the regexMappingProcessors. If a regex matches a field which already has a fieldProcessor assigned to it, it throws an IllegalArgumentException.
After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.
Additionally this processor can extract a weight from each row and insert it into the example, along with more general metadata fields (e.g., the row number, date stamps). The weightExtractor can be null, and if so the weights are left unset.
metadataExtractors
- The metadata extractors to run per example. If two metadata extractors emit
the same metadata name then the constructor throws a PropertyException.weightExtractor
- The weight extractor, if null the weights are left unset at their default.responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.regexMappingProcessors
- A set of field processors which can be instantiated if the regexes match the field names.featureProcessors
- The feature processors to run on each extracted feature list.public RowProcessor(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap, Map<String,FieldProcessor> regexMappingProcessors, Set<FeatureProcessor> featureProcessors, boolean replaceNewlinesWithSpaces)
In addition this processor can instantiate field processors which match the regexes supplied in the regexMappingProcessors. If a regex matches a field which already has a fieldProcessor assigned to it, it throws an IllegalArgumentException.
After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.
Additionally this processor can extract a weight from each row and insert it into the example, along with more general metadata fields (e.g., the row number, date stamps). The weightExtractor can be null, and if so the weights are left unset.
metadataExtractors
- The metadata extractors to run per example. If two metadata extractors emit
the same metadata name then the constructor throws a PropertyException.weightExtractor
- The weight extractor, if null the weights are left unset at their default.responseProcessor
- The response processor to use.fieldProcessorMap
- The keys are the field names and the values are the field processors to apply to those fields.regexMappingProcessors
- A set of field processors which can be instantiated if the regexes match the field names.featureProcessors
- The feature processors to run on each extracted feature list.replaceNewlinesWithSpaces
- Replace newlines with spaces in values before passing them to field processors.protected RowProcessor()
public void postConfig()
postConfig
in interface com.oracle.labs.mlrg.olcut.config.Configurable
public ResponseProcessor<T> getResponseProcessor()
public Map<String,FieldProcessor> getFieldProcessors()
FieldProcessor
s this RowProcessor uses.public Set<FeatureProcessor> getFeatureProcessors()
FeatureProcessor
s this RowProcessor uses.public Optional<Example<T>> generateExample(ColumnarIterator.Row row, boolean outputRequired)
Example
from the supplied row. Returns an empty Optional if
there are no features, or the response is required but it was not found. The latter case is
used at training time.row
- The row to process.outputRequired
- If an Output must be found in the row to return an Example.public Optional<Example<T>> generateExample(Map<String,String> row, boolean outputRequired)
Example
from the supplied row. Returns an empty Optional if
there are no features, or the response is required but it was not found.
Supplies -1 as the example index, used in cases where the index isn't meaningful.
row
- The row to process.outputRequired
- If an Output must be found in the row to return an Example.public Optional<Example<T>> generateExample(long idx, Map<String,String> row, boolean outputRequired)
Example
from the supplied row. Returns an empty Optional if
there are no features, or the response is required but it was not found. The latter case is
used at training time.idx
- The index for use in the example metadata if desired.row
- The row to process.outputRequired
- If an Output must be found in the row to return an Example.public Map<String,Object> generateMetadata(ColumnarIterator.Row row)
row
- The row to process.public List<ColumnarFeature> generateFeatures(Map<String,String> row)
row
- The row to process.ColumnarFeature
s.public Set<String> getColumnNames()
public String getDescription()
public Map<String,Class<?>> getMetadataTypes()
public boolean isConfigured()
public void expandRegexMapping(Model<T> model)
TransformationMap.validateTransformations(org.tribuo.FeatureMap)
to check the regexes
against the ImmutableFeatureMap
contained in the supplied Model
.
Throws an IllegalArgumentException if any regexes overlap with
themselves, or with the currently defined set of fieldProcessorMap.model
- The model to use to expand the regexes.public void expandRegexMapping(ImmutableFeatureMap featureMap)
TransformationMap.validateTransformations(org.tribuo.FeatureMap)
to check the regexes
against the supplied feature map. Throws an IllegalArgumentException if any regexes overlap with
themselves, or with the currently defined set of fieldProcessorMap.featureMap
- The feature map to use to expand the regexes.public void expandRegexMapping(Collection<String> fieldNames)
TransformationMap.validateTransformations(org.tribuo.FeatureMap)
to check the regexes
against the supplied list of field names. Throws an IllegalArgumentException if any regexes overlap with
themselves, or with the currently defined set of fieldProcessorMap or if there are unmatched regexes after
processing.fieldNames
- The list of field names.protected Set<String> partialExpandRegexMapping(Collection<String> fieldNames)
expandRegexMapping(org.tribuo.Model<T>)
without any of the checks that ensure the RowProcessor is in a valid state. This can be used in a subclass to expand a regex mapping
several times for a single instance of RowProcessor. The caller is responsible for ensuring that fieldNames are not duplicated
within or between calls.fieldNames
- The list of field names - should contain only previously unseen field names.@Deprecated public RowProcessor<T> copy()
When using regexMappingProcessors, RowProcessor is stateful in a way that can sometimes make it fail the second time it is used. Concretely:
RowProcessor rp; Dataset ds1 = new MutableDataset(new CSVDataSource(csvfile1, rp)); Dataset ds2 = new MutableDataset(new CSVDataSource(csvfile2, rp)); // this may fail due to state in rpThis method returns a RowProcessor with clean state and the same configuration as this row processor.
public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
getProvenance
in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
Copyright © 2015–2021 Oracle and/or its affiliates. All rights reserved.