java.lang.Object
com.pervasive.datarush.operators.AbstractLogicalOperator
com.pervasive.datarush.operators.CompositeOperator
com.pervasive.datarush.matching.cluster.ClusterDuplicates
- All Implemented Interfaces:
LogicalOperator
Transform record pairs into clusters of like records, where the two sides of
the pair are from the same source. The output of the
DiscoverDuplicates operator is a stream of record pairs. Each pair of
records has passed the given qualifications for being
a potential match. This operator takes the record pair input and finds clusters
of records that are alike. For example, a row contains records A and B, another
contains records B and C. This operator will create a cluster for records A, B
and C, generate a unique cluster identifier for the grouping and output a row
for records A, B and C with the generated cluster identifier.
A cluster may contain any number of records. Note that the original record pairings are lost as are the scores.
-
Constructor Summary
ConstructorsConstructorDescriptionCluster record pairs using a default record id field of "id".ClusterDuplicates(String dataIdField) Cluster record pairs using the specified record id field -
Method Summary
Modifier and TypeMethodDescriptionprotected voidCompose the body of this operator.Gets the name of the field uniquely identifying records in the original source.getInput()Gets the record port providing the input to the clustering operation.Gets the record port providing the results of the clustering operation.voidsetDataIdField(String name) Sets the name of the field uniquely identifying records in the original source.Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
Constructor Details
-
ClusterDuplicates
public ClusterDuplicates()Cluster record pairs using a default record id field of "id". UsesetDataIdField(String)to change this setting. -
ClusterDuplicates
Cluster record pairs using the specified record id field- Parameters:
dataIdField- the field uniquely identifying source records
-
-
Method Details
-
getInput
Gets the record port providing the input to the clustering operation.- Returns:
- the input port for the operation
-
getOutput
Gets the record port providing the results of the clustering operation.- Returns:
- the output port for the operation
-
getDataIdField
Gets the name of the field uniquely identifying records in the original source.This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
- Returns:
- the field uniquely identifying source records
-
setDataIdField
Sets the name of the field uniquely identifying records in the original source.This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
- Parameters:
name- the field uniquely identifying source records
-
compose
Description copied from class:CompositeOperatorCompose the body of this operator. Implementations should do the following:- Perform any validation of configuration, input types, etc
- Instantiate and configure sub-operators, adding them to the provided context via
the method
OperatorComposable.add(O) - Create necessary connections via the method
OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
- Specified by:
composein classCompositeOperator- Parameters:
ctx- the context
-