- java.lang.Object
-
- com.pervasive.datarush.operators.AbstractLogicalOperator
-
- com.pervasive.datarush.operators.CompositeOperator
-
- com.pervasive.datarush.matching.DiscoverDuplicates
-
- All Implemented Interfaces:
LogicalOperator
public class DiscoverDuplicates extends CompositeOperator
Discover duplicate records within a single source using fuzzy matching operators.The first step in a matching operation is to index the input data records into groups for processing by the configured phases of field comparisons, classifiers and filter. This indexing is useful in potentially reducing the number of records that must be compared. The output of this step in the matching operation is a stream of record pairs that must be compared, classified and filtered.
Record pair comparisons happen in configured phases. A matching operation may consist of a single phase. Each phase consists of a set of field comparisons, classifiers and a filter. Field comparisons compare a field from each source using a fuzzy matching comparison operator. Each comparison outputs a field comparison score. A classifier may be used to classify or aggregate multiple field scores into a single score. A classifier outputs a single value representing the composite score. A phase may utilize zero to many classifiers and a classifier can be used to aggregate scores from many classifiers. A filter is the last step of a phase. The filter ensures that record pairs are pushed to the output stream only if they meet the filter criteria. The output of this matching operation is a stream of record pairs that are deemed to be likely matches. Each record pair will contain a record score that determines the strength of the match on the spectrum from zero to one. A score approaching 0 is an unlikely match. A score approaching 1 is a very likely match.
-
-
Constructor Summary
Constructors Constructor Description DiscoverDuplicates()
Discover duplicates using initial defaults.DiscoverDuplicates(Index index, List<Phase> phases)
Discover duplicates using multiple phases of comparison, classifying and filtering.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
compose(CompositionContext ctx)
Compose the body of this operator.Index
getIndex()
Gets the pair generation method for determining initial candidate matches.RecordPort
getInput()
Gets the record port providing input to the operation.RecordPort
getOutput()
Gets the record port providing the output from the operation.List<Phase>
getPhases()
Gets the phases of comparison, classifying and filtering used to determine matches.void
setIndex(Index index)
Sets the pair generation method for determining initial candidate matches.void
setPhases(List<Phase> phases)
Sets the phases of comparison, classifying and filtering used to determine matches.-
Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
-
-
-
Method Detail
-
getInput
public RecordPort getInput()
Gets the record port providing input to the operation.- Returns:
- the input port for the operation
-
setPhases
public void setPhases(List<Phase> phases)
Sets the phases of comparison, classifying and filtering used to determine matches.- Parameters:
phases
- definition of phases for field comparisons
-
getOutput
public RecordPort getOutput()
Gets the record port providing the output from the operation.- Returns:
- the output port for the operation
-
setIndex
public void setIndex(Index index)
Sets the pair generation method for determining initial candidate matches.- Parameters:
index
- properties used to index the input data
-
getIndex
public Index getIndex()
Gets the pair generation method for determining initial candidate matches.- Returns:
- properties used to index the input data
-
getPhases
public List<Phase> getPhases()
Gets the phases of comparison, classifying and filtering used to determine matches.- Returns:
- definition of phases for field comparisons
-
compose
protected void compose(CompositionContext ctx)
Description copied from class:CompositeOperator
Compose the body of this operator. Implementations should do the following:- Perform any validation of configuration, input types, etc
- Instantiate and configure sub-operators, adding them to the provided context via
the method
OperatorComposable.add(O)
- Create necessary connections via the method
OperatorComposable.connect(P, P)
. This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
- Specified by:
compose
in classCompositeOperator
- Parameters:
ctx
- the context
-
-