Class DiscoverDuplicates

All Implemented Interfaces:
LogicalOperator

public class DiscoverDuplicates extends CompositeOperator
Discover duplicate records within a single source using fuzzy matching operators.

The first step in a matching operation is to index the input data records into groups for processing by the configured phases of field comparisons, classifiers and filter. This indexing is useful in potentially reducing the number of records that must be compared. The output of this step in the matching operation is a stream of record pairs that must be compared, classified and filtered.

Record pair comparisons happen in configured phases. A matching operation may consist of a single phase. Each phase consists of a set of field comparisons, classifiers and a filter. Field comparisons compare a field from each source using a fuzzy matching comparison operator. Each comparison outputs a field comparison score. A classifier may be used to classify or aggregate multiple field scores into a single score. A classifier outputs a single value representing the composite score. A phase may utilize zero to many classifiers and a classifier can be used to aggregate scores from many classifiers. A filter is the last step of a phase. The filter ensures that record pairs are pushed to the output stream only if they meet the filter criteria. The output of this matching operation is a stream of record pairs that are deemed to be likely matches. Each record pair will contain a record score that determines the strength of the match on the spectrum from zero to one. A score approaching 0 is an unlikely match. A score approaching 1 is a very likely match.

  • Constructor Details

    • DiscoverDuplicates

      public DiscoverDuplicates()
      Discover duplicates using initial defaults.
    • DiscoverDuplicates

      public DiscoverDuplicates(Index index, List<Phase> phases)
      Discover duplicates using multiple phases of comparison, classifying and filtering.
      Parameters:
      index - properties used to index the input data
      phases - definition of phases for field comparisons
  • Method Details

    • getInput

      public RecordPort getInput()
      Gets the record port providing input to the operation.
      Returns:
      the input port for the operation
    • setPhases

      public void setPhases(List<Phase> phases)
      Sets the phases of comparison, classifying and filtering used to determine matches.
      Parameters:
      phases - definition of phases for field comparisons
    • getOutput

      public RecordPort getOutput()
      Gets the record port providing the output from the operation.
      Returns:
      the output port for the operation
    • setIndex

      public void setIndex(Index index)
      Sets the pair generation method for determining initial candidate matches.
      Parameters:
      index - properties used to index the input data
    • getIndex

      public Index getIndex()
      Gets the pair generation method for determining initial candidate matches.
      Returns:
      properties used to index the input data
    • getPhases

      public List<Phase> getPhases()
      Gets the phases of comparison, classifying and filtering used to determine matches.
      Returns:
      definition of phases for field comparisons
    • compose

      protected void compose(CompositionContext ctx)
      Description copied from class: CompositeOperator
      Compose the body of this operator. Implementations should do the following:
      1. Perform any validation of configuration, input types, etc
      2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
      3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
      Specified by:
      compose in class CompositeOperator
      Parameters:
      ctx - the context