Class RemoveDuplicates

All Implemented Interfaces:
LogicalOperator, PipelineOperator<RecordPort>, RecordPipelineOperator

public class RemoveDuplicates extends CompositeOperator implements RecordPipelineOperator
Removes duplicate rows based on a specified set of group keys. The "first" record of a key value group is pushed to the output. Other records with the same key values are ignored. The "first" record of a key group is determined by sorting all rows of each key group by the specified sortKeys. If sortKeys is unspecified, then this will output an arbitrary row.
  • Constructor Details

    • RemoveDuplicates

      public RemoveDuplicates()
      Default constructor. Prior to graph compilation the following properties must be set:
    • RemoveDuplicates

      public RemoveDuplicates(List<String> groupKeys)
      Remove duplicates, specifying keys.
      Parameters:
      groupKeys - the names of the key fields. Must not be empty or null.
  • Method Details

    • getInput

      public RecordPort getInput()
      Returns the data to be de-duplicated
      Specified by:
      getInput in interface PipelineOperator<RecordPort>
      Returns:
      the data to be de-duplicated
    • getOutput

      public RecordPort getOutput()
      Returns the de-duplicated data
      Specified by:
      getOutput in interface PipelineOperator<RecordPort>
      Returns:
      the de-duplicated data
    • compose

      protected void compose(CompositionContext ctx)
      Description copied from class: CompositeOperator
      Compose the body of this operator. Implementations should do the following:
      1. Perform any validation of configuration, input types, etc
      2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
      3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
      Specified by:
      compose in class CompositeOperator
      Parameters:
      ctx - the context
    • getGroupKeys

      public String[] getGroupKeys()
      Returns the keys by which to de-duplicate.
      Returns:
      the keys by which to de-duplicate.
    • setGroupKeys

      public void setGroupKeys(String[] groupKeys)
      Sets the keys by which to de-duplicate.
      Parameters:
      groupKeys - the keys by which to de-duplicate.
    • getSortKeys

      public SortKey[] getSortKeys()
      Returns the additional keys by which to sort the data to determine which row to output in the event of a duplicate. This is an optional property; if left unspecified, an arbitrary row will be output.
      Returns:
      the additional keys by which to sort the data
    • setSortKeys

      public void setSortKeys(SortKey[] sortKeys)
      Sets the additional keys by which to sort the data to determine which row to output in the event of a duplicate. This is an optional property; if left unspecified, an arbitrary row will be output.
      Parameters:
      sortKeys - the additional keys by which to sort the data
    • setSortKeys

      public void setSortKeys(String... keys)
      Sets the additional keys by which to sort the data to determine which row to output in the event of a duplicate. This is an optional property; if left unspecified, an arbitrary row will be output.
      Parameters:
      keys - the additional keys by which to sort the data