Class RowProcessor<T extends Output<T>>

java.lang.Object
org.tribuo.data.columnar.RowProcessor<T>
All Implemented Interfaces:
com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>

public class RowProcessor<T extends Output<T>> extends Object implements com.oracle.labs.mlrg.olcut.config.Configurable, com.oracle.labs.mlrg.olcut.provenance.Provenancable<com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance>
A processor which takes a Map of String to String and returns an Example.

It accepts a ResponseProcessor which converts the response field into an Output, a Map of FieldProcessors which converts fields into ColumnarFeatures, and a Set of FeatureProcessors which processes the list of ColumnarFeatures before Example construction. Optionally metadata and weights can be extracted using FieldExtractors and written into each example as they are constructed.

If the metadata extractors are invalid (i.e., two extractors write to the same metadata key), the RowProcessor throws PropertyException.

  • Field Details

    • weightExtractor

      @Config(description="Extractor for the example weight.") protected FieldExtractor<Float> weightExtractor
    • responseProcessor

      @Config(mandatory=true, description="Processor which extracts the response.") protected ResponseProcessor<T extends Output<T>> responseProcessor
    • fieldProcessorMap

      protected Map<String,FieldProcessor> fieldProcessorMap
    • regexMappingProcessors

      @Config(description="A map from a regex to field processors to apply to fields matching the regex.") protected Map<String,FieldProcessor> regexMappingProcessors
    • replaceNewlinesWithSpaces

      @Config(description="Replace newlines with spaces in values before passing them to field processors.") protected boolean replaceNewlinesWithSpaces
    • configured

      protected boolean configured
  • Constructor Details

    • RowProcessor

      public RowProcessor(ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap)
      Constructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.

      This processor does not generate any additional metadata for the examples, nor does it set the weight value on generated examples.

      Parameters:
      responseProcessor - The response processor to use.
      fieldProcessorMap - The keys are the field names and the values are the field processors to apply to those fields.
    • RowProcessor

      public RowProcessor(ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap, Set<FeatureProcessor> featureProcessors)
      Constructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.

      After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.

      This processor does not generate any additional metadata for the examples, nor does it set the weight value on generated examples.

      Parameters:
      responseProcessor - The response processor to use.
      fieldProcessorMap - The keys are the field names and the values are the field processors to apply to those fields.
      featureProcessors - The feature processors to run on each extracted feature list.
    • RowProcessor

      public RowProcessor(List<FieldExtractor<?>> metadataExtractors, ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap)
      Constructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.

      Additionally this processor can extract and populate metadata fields on the generated examples (e.g., the row number, date stamps).

      Parameters:
      metadataExtractors - The metadata extractors to run per example. If two metadata extractors emit the same metadata name then the constructor throws a PropertyException.
      responseProcessor - The response processor to use.
      fieldProcessorMap - The keys are the field names and the values are the field processors to apply to those fields.
    • RowProcessor

      public RowProcessor(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap, Set<FeatureProcessor> featureProcessors)
      Constructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.

      After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.

      Additionally this processor can extract a weight from each row and insert it into the example, along with more general metadata fields (e.g., the row number, date stamps). The weightExtractor can be null, and if so the weights are left unset.

      Parameters:
      metadataExtractors - The metadata extractors to run per example. If two metadata extractors emit the same metadata name then the constructor throws a PropertyException.
      weightExtractor - The weight extractor, if null the weights are left unset at their default.
      responseProcessor - The response processor to use.
      fieldProcessorMap - The keys are the field names and the values are the field processors to apply to those fields.
      featureProcessors - The feature processors to run on each extracted feature list.
    • RowProcessor

      public RowProcessor(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap, Map<String,FieldProcessor> regexMappingProcessors, Set<FeatureProcessor> featureProcessors)
      Constructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.

      In addition this processor can instantiate field processors which match the regexes supplied in the regexMappingProcessors. If a regex matches a field which already has a fieldProcessor assigned to it, it throws an IllegalArgumentException.

      After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.

      Additionally this processor can extract a weight from each row and insert it into the example, along with more general metadata fields (e.g., the row number, date stamps). The weightExtractor can be null, and if so the weights are left unset.

      Parameters:
      metadataExtractors - The metadata extractors to run per example. If two metadata extractors emit the same metadata name then the constructor throws a PropertyException.
      weightExtractor - The weight extractor, if null the weights are left unset at their default.
      responseProcessor - The response processor to use.
      fieldProcessorMap - The keys are the field names and the values are the field processors to apply to those fields.
      regexMappingProcessors - A set of field processors which can be instantiated if the regexes match the field names.
      featureProcessors - The feature processors to run on each extracted feature list.
    • RowProcessor

      public RowProcessor(List<FieldExtractor<?>> metadataExtractors, FieldExtractor<Float> weightExtractor, ResponseProcessor<T> responseProcessor, Map<String,FieldProcessor> fieldProcessorMap, Map<String,FieldProcessor> regexMappingProcessors, Set<FeatureProcessor> featureProcessors, boolean replaceNewlinesWithSpaces)
      Constructs a RowProcessor using the supplied responseProcessor to extract the response variable, and the supplied fieldProcessorMap to control which fields are parsed and how they are parsed.

      In addition this processor can instantiate field processors which match the regexes supplied in the regexMappingProcessors. If a regex matches a field which already has a fieldProcessor assigned to it, it throws an IllegalArgumentException.

      After extraction the features are then processed using the supplied set of feature processors. These processors can be used to insert conjunction features which are triggered when multiple features appear, or to filter out unnecessary features.

      Additionally this processor can extract a weight from each row and insert it into the example, along with more general metadata fields (e.g., the row number, date stamps). The weightExtractor can be null, and if so the weights are left unset.

      Parameters:
      metadataExtractors - The metadata extractors to run per example. If two metadata extractors emit the same metadata name then the constructor throws a PropertyException.
      weightExtractor - The weight extractor, if null the weights are left unset at their default.
      responseProcessor - The response processor to use.
      fieldProcessorMap - The keys are the field names and the values are the field processors to apply to those fields.
      regexMappingProcessors - A set of field processors which can be instantiated if the regexes match the field names.
      featureProcessors - The feature processors to run on each extracted feature list.
      replaceNewlinesWithSpaces - Replace newlines with spaces in values before passing them to field processors.
    • RowProcessor

      protected RowProcessor()
      For olcut.
  • Method Details

    • postConfig

      public void postConfig()
      Used by the OLCUT configuration system, and should not be called by external code.
      Specified by:
      postConfig in interface com.oracle.labs.mlrg.olcut.config.Configurable
    • getResponseProcessor

      public ResponseProcessor<T> getResponseProcessor()
      Returns the response processor this RowProcessor uses.
      Returns:
      The response processor.
    • getFieldProcessors

      public Map<String,FieldProcessor> getFieldProcessors()
      Returns the map of FieldProcessors this RowProcessor uses.
      Returns:
      The field processors.
    • getFeatureProcessors

      public Set<FeatureProcessor> getFeatureProcessors()
      Returns the set of FeatureProcessors this RowProcessor uses.
      Returns:
      The feature processors.
    • generateExample

      public Optional<Example<T>> generateExample(ColumnarIterator.Row row, boolean outputRequired)
      Generate an Example from the supplied row. Returns an empty Optional if there are no features, or the response is required but it was not found. The latter case is used at training time.
      Parameters:
      row - The row to process.
      outputRequired - If an Output must be found in the row to return an Example.
      Returns:
      An Optional containing an Example if the row was valid, an empty Optional otherwise.
    • generateExample

      public Optional<Example<T>> generateExample(Map<String,String> row, boolean outputRequired)
      Generate an Example from the supplied row. Returns an empty Optional if there are no features, or the response is required but it was not found.

      Supplies -1 as the example index, used in cases where the index isn't meaningful.

      Parameters:
      row - The row to process.
      outputRequired - If an Output must be found in the row to return an Example.
      Returns:
      An Optional containing an Example if the row was valid, an empty Optional otherwise.
    • generateExample

      public Optional<Example<T>> generateExample(long idx, Map<String,String> row, boolean outputRequired)
      Generate an Example from the supplied row. Returns an empty Optional if there are no features, or the response is required but it was not found. The latter case is used at training time.
      Parameters:
      idx - The index for use in the example metadata if desired.
      row - The row to process.
      outputRequired - If an Output must be found in the row to return an Example.
      Returns:
      An Optional containing an Example if the row was valid, an empty Optional otherwise.
    • generateMetadata

      public Map<String,Object> generateMetadata(ColumnarIterator.Row row)
      Generates the example metadata from the supplied row and index.
      Parameters:
      row - The row to process.
      Returns:
      A (possibly empty) map containing the metadata.
    • generateFeatures

      public List<ColumnarFeature> generateFeatures(Map<String,String> row)
      Generates the features from the supplied row.
      Parameters:
      row - The row to process.
      Returns:
      A (possibly empty) list of ColumnarFeatures.
    • getColumnNames

      public Set<String> getColumnNames()
      The set of column names this will use for the feature processing.
      Returns:
      The set of column names it processes.
    • getDescription

      public String getDescription()
      Returns a description of the row processor and it's fields.
      Returns:
      A String description of the RowProcessor.
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • getMetadataTypes

      public Map<String,Class<?>> getMetadataTypes()
      Returns the metadata keys and value types that are extracted by this RowProcessor.
      Returns:
      The metadata keys and value types.
    • isConfigured

      public boolean isConfigured()
      Returns true if the regexes have been expanded into field processors.
      Returns:
      True if the RowProcessor has seen the set of input fields.
    • expandRegexMapping

      public void expandRegexMapping(Model<T> model)
      Uses similar logic to TransformationMap.validateTransformations(org.tribuo.FeatureMap) to check the regexes against the ImmutableFeatureMap contained in the supplied Model. Throws an IllegalArgumentException if any regexes overlap with themselves, or with the currently defined set of fieldProcessorMap.
      Parameters:
      model - The model to use to expand the regexes.
    • expandRegexMapping

      public void expandRegexMapping(ImmutableFeatureMap featureMap)
      Uses similar logic to TransformationMap.validateTransformations(org.tribuo.FeatureMap) to check the regexes against the supplied feature map. Throws an IllegalArgumentException if any regexes overlap with themselves, or with the currently defined set of fieldProcessorMap.
      Parameters:
      featureMap - The feature map to use to expand the regexes.
    • expandRegexMapping

      public void expandRegexMapping(Collection<String> fieldNames)
      Uses similar logic to TransformationMap.validateTransformations(org.tribuo.FeatureMap) to check the regexes against the supplied list of field names. Throws an IllegalArgumentException if any regexes overlap with themselves, or with the currently defined set of fieldProcessorMap or if there are unmatched regexes after processing.
      Parameters:
      fieldNames - The list of field names.
    • partialExpandRegexMapping

      protected Set<String> partialExpandRegexMapping(Collection<String> fieldNames)
      Caveat Implementor! This method contains the logic of expandRegexMapping(org.tribuo.Model<T>) without any of the checks that ensure the RowProcessor is in a valid state. This can be used in a subclass to expand a regex mapping several times for a single instance of RowProcessor. The caller is responsible for ensuring that fieldNames are not duplicated within or between calls.
      Parameters:
      fieldNames - The list of field names - should contain only previously unseen field names.
      Returns:
      the set of regexes that were matched by fieldNames.
    • copy

      @Deprecated public RowProcessor<T> copy()
      Deprecated.
      In a future release this API will change, in the meantime this is the correct way to get a row processor with clean state.

      When using regexMappingProcessors, RowProcessor is stateful in a way that can sometimes make it fail the second time it is used. Concretely:

           RowProcessor rp;
           Dataset ds1 = new MutableDataset(new CSVDataSource(csvfile1, rp));
           Dataset ds2 = new MutableDataset(new CSVDataSource(csvfile2, rp)); // this may fail due to state in rp
       
      This method returns a RowProcessor with clean state and the same configuration as this row processor.
      Returns:
      a RowProcessor instance with clean state and the same configuration as this row processor.
    • getProvenance

      public com.oracle.labs.mlrg.olcut.provenance.ConfiguredObjectProvenance getProvenance()
      Specified by:
      getProvenance in interface com.oracle.labs.mlrg.olcut.provenance.Provenancable<T extends Output<T>>