- java.lang.Object
-
- com.pervasive.datarush.operators.AbstractLogicalOperator
-
- com.pervasive.datarush.operators.CompositeOperator
-
- com.pervasive.datarush.matching.cluster.ClusterDuplicates
-
- All Implemented Interfaces:
LogicalOperator
public class ClusterDuplicates extends CompositeOperator
Transform record pairs into clusters of like records, where the two sides of the pair are from the same source. The output of theDiscoverDuplicates
operator is a stream of record pairs. Each pair of records has passed the given qualifications for being a potential match. This operator takes the record pair input and finds clusters of records that are alike. For example, a row contains records A and B, another contains records B and C. This operator will create a cluster for records A, B and C, generate a unique cluster identifier for the grouping and output a row for records A, B and C with the generated cluster identifier.A cluster may contain any number of records. Note that the original record pairings are lost as are the scores.
-
-
Constructor Summary
Constructors Constructor Description ClusterDuplicates()
Cluster record pairs using a default record id field of "id".ClusterDuplicates(String dataIdField)
Cluster record pairs using the specified record id field
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
compose(CompositionContext ctx)
Compose the body of this operator.String
getDataIdField()
Gets the name of the field uniquely identifying records in the original source.RecordPort
getInput()
Gets the record port providing the input to the clustering operation.RecordPort
getOutput()
Gets the record port providing the results of the clustering operation.void
setDataIdField(String name)
Sets the name of the field uniquely identifying records in the original source.-
Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
-
-
-
Constructor Detail
-
ClusterDuplicates
public ClusterDuplicates()
Cluster record pairs using a default record id field of "id". UsesetDataIdField(String)
to change this setting.
-
ClusterDuplicates
public ClusterDuplicates(String dataIdField)
Cluster record pairs using the specified record id field- Parameters:
dataIdField
- the field uniquely identifying source records
-
-
Method Detail
-
getInput
public RecordPort getInput()
Gets the record port providing the input to the clustering operation.- Returns:
- the input port for the operation
-
getOutput
public RecordPort getOutput()
Gets the record port providing the results of the clustering operation.- Returns:
- the output port for the operation
-
getDataIdField
public String getDataIdField()
Gets the name of the field uniquely identifying records in the original source.This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
- Returns:
- the field uniquely identifying source records
-
setDataIdField
public void setDataIdField(String name)
Sets the name of the field uniquely identifying records in the original source.This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
- Parameters:
name
- the field uniquely identifying source records
-
compose
protected void compose(CompositionContext ctx)
Description copied from class:CompositeOperator
Compose the body of this operator. Implementations should do the following:- Perform any validation of configuration, input types, etc
- Instantiate and configure sub-operators, adding them to the provided context via
the method
OperatorComposable.add(O)
- Create necessary connections via the method
OperatorComposable.connect(P, P)
. This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
- Specified by:
compose
in classCompositeOperator
- Parameters:
ctx
- the context
-
-