Class ClusterDuplicates

  • All Implemented Interfaces:
    LogicalOperator

    public class ClusterDuplicates
    extends CompositeOperator
    Transform record pairs into clusters of like records, where the two sides of the pair are from the same source. The output of the DiscoverDuplicates operator is a stream of record pairs. Each pair of records has passed the given qualifications for being a potential match. This operator takes the record pair input and finds clusters of records that are alike. For example, a row contains records A and B, another contains records B and C. This operator will create a cluster for records A, B and C, generate a unique cluster identifier for the grouping and output a row for records A, B and C with the generated cluster identifier.

    A cluster may contain any number of records. Note that the original record pairings are lost as are the scores.

    • Constructor Detail

      • ClusterDuplicates

        public ClusterDuplicates()
        Cluster record pairs using a default record id field of "id". Use setDataIdField(String) to change this setting.
      • ClusterDuplicates

        public ClusterDuplicates​(String dataIdField)
        Cluster record pairs using the specified record id field
        Parameters:
        dataIdField - the field uniquely identifying source records
    • Method Detail

      • getInput

        public RecordPort getInput()
        Gets the record port providing the input to the clustering operation.
        Returns:
        the input port for the operation
      • getOutput

        public RecordPort getOutput()
        Gets the record port providing the results of the clustering operation.
        Returns:
        the output port for the operation
      • getDataIdField

        public String getDataIdField()
        Gets the name of the field uniquely identifying records in the original source.

        This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.

        Returns:
        the field uniquely identifying source records
      • setDataIdField

        public void setDataIdField​(String name)
        Sets the name of the field uniquely identifying records in the original source.

        This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.

        Parameters:
        name - the field uniquely identifying source records
      • compose

        protected void compose​(CompositionContext ctx)
        Description copied from class: CompositeOperator
        Compose the body of this operator. Implementations should do the following:
        1. Perform any validation of configuration, input types, etc
        2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
        3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
        Specified by:
        compose in class CompositeOperator
        Parameters:
        ctx - the context