Class ReplaceMissingValues

All Implemented Interfaces:
LogicalOperator, PipelineOperator<RecordPort>, RecordPipelineOperator

public class ReplaceMissingValues extends CompositeOperator implements RecordPipelineOperator
Replace missing values in the input data according to the given replacement specifications. Each specification provides an action to take and specifies the affected fields. Some actions require a first pass through the data to calculated needed column values such as the minimum value, maximum value, mean, or most frequent value. If any of these actions are specified, the data will be read to calculate the required values. The next pass of the data applies the replacements specified utilizing the calculated data.

The order of the input data is preserved where possible. However, when using the action to skip records with missing data, records may be reordered. This is due to how the data is partitioned for parallelization.

A PMML model is created that contains statistics about the number of records skipped and the number of field values replaced. This model is similar to the one created by the SummaryStatistics operator.

  • Constructor Details

    • ReplaceMissingValues

      public ReplaceMissingValues()
      Defines a replacement with an empty specification. That is, no missing input values are replaced.
  • Method Details

    • getInput

      public RecordPort getInput()
      Gets the record port providing the input data to the operation.
      Specified by:
      getInput in interface PipelineOperator<RecordPort>
      Returns:
      the input port for the operation
    • getStatisticsInput

      public PMMLPort getStatisticsInput()
      Gets the optional model port providing statistics for replace specifications based on column statistics. If not connected and some specification depends on statistics, statistics will automatically be calculated as part of the operation.
      Returns:
      the statistics port for the operation
    • getOutput

      public RecordPort getOutput()
      Gets the record port providing the output from the operation. This will be the input data with null values replaced as specified.
      Specified by:
      getOutput in interface PipelineOperator<RecordPort>
      Returns:
      the output port for the operation
    • getModel

      public PMMLPort getModel()
      Returns a port that will output a PMMLSummaryStatisticsModel. The model will be populated with the following information:
      1. totalFrequency: total number of rows
      2. invalidFrequency: total number of rows for which at least one field with a skip condition was found
      3. missingFrequency: total number of rows for which at least one field with a replace condition was found
      4. testFailureCounts: per-test failure counts for each condition involving the given field
      Returns:
      a port that will output a PMMLSummaryStatisticsModel.
    • getReplaceSpecifications

      public List<ReplaceSpecification> getReplaceSpecifications()
      Gets the specifications currently configured for the operation.
      Returns:
      the replacement specifications being applied to the input data
    • setReplaceSpecifications

      public void setReplaceSpecifications(List<ReplaceSpecification> specifications)
      Sets the replacement specifications to apply to the input data.
      Parameters:
      specifications - the value replacement specifications to apply
    • compose

      protected void compose(CompositionContext ctx)
      Description copied from class: CompositeOperator
      Compose the body of this operator. Implementations should do the following:
      1. Perform any validation of configuration, input types, etc
      2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
      3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
      Specified by:
      compose in class CompositeOperator
      Parameters:
      ctx - the context