Class ClusterDuplicates

All Implemented Interfaces:
LogicalOperator

public class ClusterDuplicates extends CompositeOperator
Transform record pairs into clusters of like records, where the two sides of the pair are from the same source. The output of the DiscoverDuplicates operator is a stream of record pairs. Each pair of records has passed the given qualifications for being a potential match. This operator takes the record pair input and finds clusters of records that are alike. For example, a row contains records A and B, another contains records B and C. This operator will create a cluster for records A, B and C, generate a unique cluster identifier for the grouping and output a row for records A, B and C with the generated cluster identifier.

A cluster may contain any number of records. Note that the original record pairings are lost as are the scores.

  • Constructor Details

    • ClusterDuplicates

      public ClusterDuplicates()
      Cluster record pairs using a default record id field of "id". Use setDataIdField(String) to change this setting.
    • ClusterDuplicates

      public ClusterDuplicates(String dataIdField)
      Cluster record pairs using the specified record id field
      Parameters:
      dataIdField - the field uniquely identifying source records
  • Method Details

    • getInput

      public RecordPort getInput()
      Gets the record port providing the input to the clustering operation.
      Returns:
      the input port for the operation
    • getOutput

      public RecordPort getOutput()
      Gets the record port providing the results of the clustering operation.
      Returns:
      the output port for the operation
    • getDataIdField

      public String getDataIdField()
      Gets the name of the field uniquely identifying records in the original source.

      This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.

      Returns:
      the field uniquely identifying source records
    • setDataIdField

      public void setDataIdField(String name)
      Sets the name of the field uniquely identifying records in the original source.

      This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.

      Parameters:
      name - the field uniquely identifying source records
    • compose

      protected void compose(CompositionContext ctx)
      Description copied from class: CompositeOperator
      Compose the body of this operator. Implementations should do the following:
      1. Perform any validation of configuration, input types, etc
      2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
      3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
      Specified by:
      compose in class CompositeOperator
      Parameters:
      ctx - the context