public class DiscoverDuplicates extends CompositeOperator
The first step in a matching operation is to index the input data records into groups for processing by the configured phases of field comparisons, classifiers and filter. This indexing is useful in potentially reducing the number of records that must be compared. The output of this step in the matching operation is a stream of record pairs that must be compared, classified and filtered.
Record pair comparisons happen in configured phases. A matching operation may consist of a single phase. Each phase consists of a set of field comparisons, classifiers and a filter. Field comparisons compare a field from each source using a fuzzy matching comparison operator. Each comparison outputs a field comparison score. A classifier may be used to classify or aggregate multiple field scores into a single score. A classifier outputs a single value representing the composite score. A phase may utilize zero to many classifiers and a classifier can be used to aggregate scores from many classifiers. A filter is the last step of a phase. The filter ensures that record pairs are pushed to the output stream only if they meet the filter criteria. The output of this matching operation is a stream of record pairs that are deemed to be likely matches. Each record pair will contain a record score that determines the strength of the match on the spectrum from zero to one. A score approaching 0 is an unlikely match. A score approaching 1 is a very likely match.
Constructor and Description |
---|
DiscoverDuplicates()
Discover duplicates using initial defaults.
|
DiscoverDuplicates(Index index,
List<Phase> phases)
Discover duplicates using multiple phases of comparison, classifying and filtering.
|
Modifier and Type | Method and Description |
---|---|
protected void |
compose(CompositionContext ctx)
Compose the body of this operator.
|
Index |
getIndex()
Gets the pair generation method for determining initial
candidate matches.
|
RecordPort |
getInput()
Gets the record port providing input to the operation.
|
RecordPort |
getOutput()
Gets the record port providing the output from the operation.
|
List<Phase> |
getPhases()
Gets the phases of comparison, classifying and filtering
used to determine matches.
|
void |
setIndex(Index index)
Sets the pair generation method for determining initial
candidate matches.
|
void |
setPhases(List<Phase> phases)
Sets the phases of comparison, classifying and filtering
used to determine matches.
|
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
public DiscoverDuplicates()
public RecordPort getInput()
public void setPhases(List<Phase> phases)
phases
- definition of phases for field comparisonspublic RecordPort getOutput()
public void setIndex(Index index)
index
- properties used to index the input datapublic Index getIndex()
public List<Phase> getPhases()
protected void compose(CompositionContext ctx)
CompositeOperator
OperatorComposable.add(O)
OperatorComposable.connect(P, P)
. This includes
connections from the composite's input ports to sub-operators, connections between sub-operators, and
connections from sub-operators output ports to the composite's output portscompose
in class CompositeOperator
ctx
- the contextCopyright © 2016 Actian Corporation. All rights reserved.