com.pervasive.datarush.matching.cluster.ClusterDuplicates

All Implemented Interfaces:: LogicalOperator

public class ClusterDuplicates extends CompositeOperator

Transform record pairs into clusters of like records, where the two sides of the pair are from the same source. The output of the DiscoverDuplicates operator is a stream of record pairs. Each pair of records has passed the given qualifications for being a potential match. This operator takes the record pair input and finds clusters of records that are alike. For example, a row contains records A and B, another contains records B and C. This operator will create a cluster for records A, B and C, generate a unique cluster identifier for the grouping and output a row for records A, B and C with the generated cluster identifier.

A cluster may contain any number of records. Note that the original record pairings are lost as are the scores.

Constructor Summary

Constructors

Constructor

Description

ClusterDuplicates()

Cluster record pairs using a default record id field of "id".

ClusterDuplicates(String dataIdField)

Cluster record pairs using the specified record id field
Method Summary

Modifier and Type

Method

Description

protected void

compose(CompositionContext ctx)

Compose the body of this operator.

String

getDataIdField()

Gets the name of the field uniquely identifying records in the original source.

RecordPort

getInput()

Gets the record port providing the input to the clustering operation.

RecordPort

getOutput()

Gets the record port providing the results of the clustering operation.

void

setDataIdField(String name)

Sets the name of the field uniquely identifying records in the original source.

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- ClusterDuplicates
  
  public ClusterDuplicates()
  
  Cluster record pairs using a default record id field of "id". Use setDataIdField(String) to change this setting.
- ClusterDuplicates
  
  public ClusterDuplicates(String dataIdField)
  
  Cluster record pairs using the specified record id field
  
  Parameters:
  
  dataIdField - the field uniquely identifying source records
Method Details
- getInput
  
  public RecordPort getInput()
  
  Gets the record port providing the input to the clustering operation.
  
  Returns:
  
  the input port for the operation
- getOutput
  
  public RecordPort getOutput()
  
  Gets the record port providing the results of the clustering operation.
  
  Returns:
  
  the output port for the operation
- getDataIdField
  
  public String getDataIdField()
  
  Gets the name of the field uniquely identifying records in the original source.
  This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
  
  Returns:
  
  the field uniquely identifying source records
- setDataIdField
  
  public void setDataIdField(String name)
  
  Sets the name of the field uniquely identifying records in the original source.
  This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
  
  Parameters:
  
  name - the field uniquely identifying source records
- compose
  
  protected void compose(CompositionContext ctx)
  
  Description copied from class: CompositeOperator
  Compose the body of this operator. Implementations should do the following:
  
  Perform any validation of configuration, input types, etc
  
  Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
  
  Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
  Specified by:
  
  compose in class CompositeOperator
  
  Parameters:
  
  ctx - the context

Class ClusterDuplicates

Constructor Summary

Method Summary

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator

Methods inherited from class java.lang.Object

Constructor Details

ClusterDuplicates

ClusterDuplicates

Method Details

getInput

getOutput

getDataIdField

setDataIdField

compose