java.lang.Object
com.pervasive.datarush.operators.AbstractLogicalOperator
com.pervasive.datarush.operators.CompositeOperator
com.pervasive.datarush.analytics.stats.SummaryStatistics
- All Implemented Interfaces:
LogicalOperator,RecordSinkOperator,SinkOperator<RecordPort>
Discovers various metrics of an input dataset, based on the configured
detail level. The types of the fields, combined with the
DetailLevel
determine the set of metrics that are calculated.- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected voidCompose the body of this operator.Returns the detail level that we use to compute statistics.Gets the fields from the input dataset for which we are collecting statistics.getInput()Returns an input port for the input dataset.Returns an output port that will produce aPMMLSummaryStatisticsModel.Gets thequantilesto calculate for each numeric field.intReturns the number ofintervalCountsto calculate for each numeric field.intProvides a cap on the number ofvalueCountsto calculate.booleanReturns a hint as to whether there are expected to be a small number of distinct values.voidsetDetailLevel(DetailLevel detailLevel) Sets the detail level that we use to compute statistics.voidsetFewDistinctValuesHint(boolean fewDistinctValuesHint) Sets a hint as to whether there are expected to be a small number of distinct values.voidsetIncludedFields(List<String> includedFields) Sets the fields from the input dataset for which we are collecting statistics.voidsetQuantilesToCalculate(List<BigDecimal> quantilesToCalculate) Sets thequantilesto calculate for each numeric field.voidsetRangeCount(int rangeCount) Sets the number ofintervalCountsto calculate for each numeric field.voidsetShowTopHowMany(int showTopHowMany) Sets a cap on the number ofvalueCountsto calculate.Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyErrorMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface com.pervasive.datarush.operators.LogicalOperator
disableParallelism, getInputPorts, getOutputPorts
-
Constructor Details
-
SummaryStatistics
public SummaryStatistics()Discover summary statistics. By default we discoversinglePassstatistics; configuredetailLevelto provide more or less detail.
-
-
Method Details
-
getInput
Returns an input port for the input dataset. This dataset is used to build the summary model.- Specified by:
getInputin interfaceRecordSinkOperator- Specified by:
getInputin interfaceSinkOperator<RecordPort>- Returns:
- an input port for the input dataset
-
getOutput
Returns an output port that will produce aPMMLSummaryStatisticsModel.- Returns:
- an output port that will produce a
PMMLSummaryStatisticsModel.
-
getDetailLevel
Returns the detail level that we use to compute statistics. The default value isDetailLevel.SINGLE_PASS_ONLY.- Returns:
- the detail level
-
setDetailLevel
Sets the detail level that we use to compute statistics. The default value isDetailLevel.SINGLE_PASS_ONLY.- Parameters:
detailLevel- the detail level
-
getShowTopHowMany
public int getShowTopHowMany()Provides a cap on the number ofvalueCountsto calculate. The default is 25. Memory usage is proportional to the number of distinct values; thus only the top n values are calculated in order to avoid excessive memory consumption in the event that the number of distinct values for a given field is large. This setting is ignored if detail level is notDetailLevel.MULTI_PASS.- Returns:
- the cap the number of
valueCountsto calculate.
-
setShowTopHowMany
public void setShowTopHowMany(int showTopHowMany) Sets a cap on the number ofvalueCountsto calculate. The default is 25. Memory usage is proportional to the number of distinct values; thus only the top n values are calculated in order to avoid excessive memory consumption in the event that the number of distinct values for a given field is large. This setting is ignored if detail level is notDetailLevel.MULTI_PASS.- Parameters:
showTopHowMany- the cap the number ofvalueCountsto calculate.
-
getRangeCount
public int getRangeCount()Returns the number ofintervalCountsto calculate for each numeric field. The default value is 10. This setting is ignored if detail level is notDetailLevel.MULTI_PASS.- Returns:
- the number of
intervalCountsto calculate for each numeric field.
-
setRangeCount
public void setRangeCount(int rangeCount) Sets the number ofintervalCountsto calculate for each numeric field. The default value is 10. This setting is ignored if detail level is notDetailLevel.MULTI_PASS.- Parameters:
rangeCount- the number ofintervalCountsto calculate for each numeric field.
-
getQuantilesToCalculate
Gets thequantilesto calculate for each numeric field. By default this is 0.25, 0.50, and 0.75 (the 25th, 50th, and 75th percentiles). This setting is ignored if detail level is notDetailLevel.MULTI_PASS.- Returns:
- the
quantilesto calculate for each numeric field.
-
setQuantilesToCalculate
Sets thequantilesto calculate for each numeric field. By default this is 0.25, 0.50, and 0.75 (the 25th, 50th, and 75th percentiles). This setting is ignored if detail level is notDetailLevel.MULTI_PASS.- Parameters:
quantilesToCalculate- thequantilesto calculate for each numeric field.
-
getIncludedFields
Gets the fields from the input dataset for which we are collecting statistics. The default value of "empty list" implies "all fields".- Returns:
- the fields from the input dataset for which we are collecting statistics.
-
setIncludedFields
Sets the fields from the input dataset for which we are collecting statistics. The default value of "empty list" implies "all fields".- Parameters:
includedFields- the fields from the input dataset for which we are collecting statistics.
-
isFewDistinctValuesHint
public boolean isFewDistinctValuesHint()Returns a hint as to whether there are expected to be a small number of distinct values. If not, we eagerly sort each column up-front and perform a parallelized computation of quantiles and frequent items. This setting is ignored if detail level is notDetailLevel.MULTI_PASS.- Returns:
- whether few distinct values are expected
-
setFewDistinctValuesHint
public void setFewDistinctValuesHint(boolean fewDistinctValuesHint) Sets a hint as to whether there are expected to be a small number of distinct values. If not, we eagerly sort each column up-front and perform a parallelized computation of quantiles and frequent items. This setting is ignored if detail level is notDetailLevel.MULTI_PASS.- Parameters:
fewDistinctValuesHint- whether few distinct values are expected
-
compose
Description copied from class:CompositeOperatorCompose the body of this operator. Implementations should do the following:- Perform any validation of configuration, input types, etc
- Instantiate and configure sub-operators, adding them to the provided context via
the method
OperatorComposable.add(O) - Create necessary connections via the method
OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
- Specified by:
composein classCompositeOperator- Parameters:
ctx- the context
-