- java.lang.Object
-
- com.pervasive.datarush.operators.AbstractLogicalOperator
-
- com.pervasive.datarush.operators.CompositeOperator
-
- com.pervasive.datarush.analytics.stats.DataQualityAnalyzer
-
- All Implemented Interfaces:
LogicalOperator
public final class DataQualityAnalyzer extends CompositeOperator
Evaluates a set of quality tests on an input dataset. Those rows for which all tests pass are considered "clean" and thus sent to theclean
output. Those rows for which any tests fail are considered "dirty" and thus sent to thedirty
output. In addition, this produces a summary model that includes the following statistics:totalFrequency
: total number of rowsinvalidFrequency
: total number of rows for which at least one test involving the given field failedtestFailureCounts
: per-test failure counts for each test involving the given field
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
DataQualityAnalyzer.QualityTest
A quality test consists of a test name (used to reference the test in the statistics) plus a boolean predicate.
-
Constructor Summary
Constructors Constructor Description DataQualityAnalyzer()
Evaluates a set of quality tests on an input dataset.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
compose(CompositionContext ctx)
Compose the body of this operator.RecordPort
getClean()
Returns a port that will output the "clean" rows.RecordPort
getDirty()
Returns a port that will output the "dirty" rows.RecordPort
getInput()
Returns a port for the input dataset to be tested.PMMLPort
getModel()
Returns a port that will output aPMMLSummaryStatisticsModel
.List<DataQualityAnalyzer.QualityTest>
getTests()
Returns the set of tests to apply to the input datasetvoid
setTests(String expression)
Sets the set of tests to apply to the input dataset.void
setTests(List<DataQualityAnalyzer.QualityTest> tests)
Sets the set of tests to apply to the input dataset-
Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
-
-
-
Method Detail
-
getInput
public RecordPort getInput()
Returns a port for the input dataset to be tested.- Returns:
- a port for the input dataset to be tested.
-
getTests
public List<DataQualityAnalyzer.QualityTest> getTests()
Returns the set of tests to apply to the input dataset- Returns:
- the set of tests to apply to the input dataset
-
setTests
public void setTests(List<DataQualityAnalyzer.QualityTest> tests)
Sets the set of tests to apply to the input dataset- Parameters:
tests
- the set of tests to apply to the input dataset
-
setTests
public void setTests(String expression)
Sets the set of tests to apply to the input dataset. The tests are expressed using the field derivation expression language. The general format of the expression language is:{expression1} as {metric1}[, {expression2} as {metric2}, ...]
. The expression themselves are predicate functions that return aboolean
value.- Parameters:
expression
- an expression that evaluates to a set of quality tests
-
getClean
public RecordPort getClean()
Returns a port that will output the "clean" rows. A row is considered clean if all tests pass.- Returns:
- a port that will output the "clean" rows
-
getDirty
public RecordPort getDirty()
Returns a port that will output the "dirty" rows. A row is considered dirty if any tests fail.- Returns:
- a port that will output the "clean" rows
-
getModel
public PMMLPort getModel()
Returns a port that will output aPMMLSummaryStatisticsModel
. The model will be populated with the following information:totalFrequency
: total number of rowsinvalidFrequency
: total number of rows for which at least one test involving the given field failedtestFailureCounts
: per-test failure counts for each test involving the given field
- Returns:
- a port that will output a
PMMLSummaryStatisticsModel
.
-
compose
protected void compose(CompositionContext ctx)
Description copied from class:CompositeOperator
Compose the body of this operator. Implementations should do the following:- Perform any validation of configuration, input types, etc
- Instantiate and configure sub-operators, adding them to the provided context via
the method
OperatorComposable.add(O)
- Create necessary connections via the method
OperatorComposable.connect(P, P)
. This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
- Specified by:
compose
in classCompositeOperator
- Parameters:
ctx
- the context
-
-