public class ClusterDuplicates extends CompositeOperator
DiscoverDuplicates
operator is a stream of record pairs. Each pair of
records has passed the given qualifications for being
a potential match. This operator takes the record pair input and finds clusters
of records that are alike. For example, a row contains records A and B, another
contains records B and C. This operator will create a cluster for records A, B
and C, generate a unique cluster identifier for the grouping and output a row
for records A, B and C with the generated cluster identifier.
A cluster may contain any number of records. Note that the original record pairings are lost as are the scores.
Constructor and Description |
---|
ClusterDuplicates()
Cluster record pairs using a default record id field
of "id".
|
ClusterDuplicates(String dataIdField)
Cluster record pairs using the specified record id field
|
Modifier and Type | Method and Description |
---|---|
protected void |
compose(CompositionContext ctx)
Compose the body of this operator.
|
String |
getDataIdField()
Gets the name of the field uniquely identifying records in the original source.
|
RecordPort |
getInput()
Gets the record port providing the input to the clustering operation.
|
RecordPort |
getOutput()
Gets the record port providing the results of the clustering operation.
|
void |
setDataIdField(String name)
Sets the name of the field uniquely identifying records in the original source.
|
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
public ClusterDuplicates()
setDataIdField(String)
to change
this setting.public ClusterDuplicates(String dataIdField)
dataIdField
- the field uniquely identifying source recordspublic RecordPort getInput()
public RecordPort getOutput()
public String getDataIdField()
This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
public void setDataIdField(String name)
This name is the one used in the original record data producing the pairs, not the formatted name used in the input pair data.
name
- the field uniquely identifying source recordsprotected void compose(CompositionContext ctx)
CompositeOperator
OperatorComposable.add(O)
OperatorComposable.connect(P, P)
. This includes
connections from the composite's input ports to sub-operators, connections between sub-operators, and
connections from sub-operators output ports to the composite's output portscompose
in class CompositeOperator
ctx
- the contextCopyright © 2020 Actian Corporation. All rights reserved.