Class DataQualityAnalyzer

  • All Implemented Interfaces:
    LogicalOperator

    public final class DataQualityAnalyzer
    extends CompositeOperator
    Evaluates a set of quality tests on an input dataset. Those rows for which all tests pass are considered "clean" and thus sent to the clean output. Those rows for which any tests fail are considered "dirty" and thus sent to the dirty output. In addition, this produces a summary model that includes the following statistics:
    1. totalFrequency: total number of rows
    2. invalidFrequency: total number of rows for which at least one test involving the given field failed
    3. testFailureCounts: per-test failure counts for each test involving the given field
    • Constructor Detail

      • DataQualityAnalyzer

        public DataQualityAnalyzer()
        Evaluates a set of quality tests on an input dataset. By default the set of tests is empty; prior to graph compilation the following property must be set:
    • Method Detail

      • getInput

        public RecordPort getInput()
        Returns a port for the input dataset to be tested.
        Returns:
        a port for the input dataset to be tested.
      • getTests

        public List<DataQualityAnalyzer.QualityTest> getTests()
        Returns the set of tests to apply to the input dataset
        Returns:
        the set of tests to apply to the input dataset
      • setTests

        public void setTests​(List<DataQualityAnalyzer.QualityTest> tests)
        Sets the set of tests to apply to the input dataset
        Parameters:
        tests - the set of tests to apply to the input dataset
      • setTests

        public void setTests​(String expression)
        Sets the set of tests to apply to the input dataset. The tests are expressed using the field derivation expression language. The general format of the expression language is: {expression1} as {metric1}[, {expression2} as {metric2}, ...]. The expression themselves are predicate functions that return a boolean value.
        Parameters:
        expression - an expression that evaluates to a set of quality tests
      • getClean

        public RecordPort getClean()
        Returns a port that will output the "clean" rows. A row is considered clean if all tests pass.
        Returns:
        a port that will output the "clean" rows
      • getDirty

        public RecordPort getDirty()
        Returns a port that will output the "dirty" rows. A row is considered dirty if any tests fail.
        Returns:
        a port that will output the "clean" rows
      • compose

        protected void compose​(CompositionContext ctx)
        Description copied from class: CompositeOperator
        Compose the body of this operator. Implementations should do the following:
        1. Perform any validation of configuration, input types, etc
        2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
        3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
        Specified by:
        compose in class CompositeOperator
        Parameters:
        ctx - the context