Class SummaryStatistics

All Implemented Interfaces:
LogicalOperator, RecordSinkOperator, SinkOperator<RecordPort>

public final class SummaryStatistics extends CompositeOperator implements RecordSinkOperator
Discovers various metrics of an input dataset, based on the configured detail level. The types of the fields, combined with the DetailLevel determine the set of metrics that are calculated.
See Also:
  • Constructor Details

    • SummaryStatistics

      public SummaryStatistics()
      Discover summary statistics. By default we discover singlePass statistics; configure detailLevel to provide more or less detail.
  • Method Details

    • getInput

      public RecordPort getInput()
      Returns an input port for the input dataset. This dataset is used to build the summary model.
      Specified by:
      getInput in interface RecordSinkOperator
      Specified by:
      getInput in interface SinkOperator<RecordPort>
      Returns:
      an input port for the input dataset
    • getOutput

      public PMMLPort getOutput()
      Returns an output port that will produce a PMMLSummaryStatisticsModel.
      Returns:
      an output port that will produce a PMMLSummaryStatisticsModel.
    • getDetailLevel

      public DetailLevel getDetailLevel()
      Returns the detail level that we use to compute statistics. The default value is DetailLevel.SINGLE_PASS_ONLY.
      Returns:
      the detail level
    • setDetailLevel

      public void setDetailLevel(DetailLevel detailLevel)
      Sets the detail level that we use to compute statistics. The default value is DetailLevel.SINGLE_PASS_ONLY.
      Parameters:
      detailLevel - the detail level
    • getShowTopHowMany

      public int getShowTopHowMany()
      Provides a cap on the number of valueCounts to calculate. The default is 25. Memory usage is proportional to the number of distinct values; thus only the top n values are calculated in order to avoid excessive memory consumption in the event that the number of distinct values for a given field is large. This setting is ignored if detail level is not DetailLevel.MULTI_PASS.
      Returns:
      the cap the number of valueCounts to calculate.
    • setShowTopHowMany

      public void setShowTopHowMany(int showTopHowMany)
      Sets a cap on the number of valueCounts to calculate. The default is 25. Memory usage is proportional to the number of distinct values; thus only the top n values are calculated in order to avoid excessive memory consumption in the event that the number of distinct values for a given field is large. This setting is ignored if detail level is not DetailLevel.MULTI_PASS.
      Parameters:
      showTopHowMany - the cap the number of valueCounts to calculate.
    • getRangeCount

      public int getRangeCount()
      Returns the number of intervalCounts to calculate for each numeric field. The default value is 10. This setting is ignored if detail level is not DetailLevel.MULTI_PASS.
      Returns:
      the number of intervalCounts to calculate for each numeric field.
    • setRangeCount

      public void setRangeCount(int rangeCount)
      Sets the number of intervalCounts to calculate for each numeric field. The default value is 10. This setting is ignored if detail level is not DetailLevel.MULTI_PASS.
      Parameters:
      rangeCount - the number of intervalCounts to calculate for each numeric field.
    • getQuantilesToCalculate

      public List<BigDecimal> getQuantilesToCalculate()
      Gets the quantiles to calculate for each numeric field. By default this is 0.25, 0.50, and 0.75 (the 25th, 50th, and 75th percentiles). This setting is ignored if detail level is not DetailLevel.MULTI_PASS.
      Returns:
      the quantiles to calculate for each numeric field.
    • setQuantilesToCalculate

      public void setQuantilesToCalculate(List<BigDecimal> quantilesToCalculate)
      Sets the quantiles to calculate for each numeric field. By default this is 0.25, 0.50, and 0.75 (the 25th, 50th, and 75th percentiles). This setting is ignored if detail level is not DetailLevel.MULTI_PASS.
      Parameters:
      quantilesToCalculate - the quantiles to calculate for each numeric field.
    • getIncludedFields

      public List<String> getIncludedFields()
      Gets the fields from the input dataset for which we are collecting statistics. The default value of "empty list" implies "all fields".
      Returns:
      the fields from the input dataset for which we are collecting statistics.
    • setIncludedFields

      public void setIncludedFields(List<String> includedFields)
      Sets the fields from the input dataset for which we are collecting statistics. The default value of "empty list" implies "all fields".
      Parameters:
      includedFields - the fields from the input dataset for which we are collecting statistics.
    • isFewDistinctValuesHint

      public boolean isFewDistinctValuesHint()
      Returns a hint as to whether there are expected to be a small number of distinct values. If not, we eagerly sort each column up-front and perform a parallelized computation of quantiles and frequent items. This setting is ignored if detail level is not DetailLevel.MULTI_PASS.
      Returns:
      whether few distinct values are expected
    • setFewDistinctValuesHint

      public void setFewDistinctValuesHint(boolean fewDistinctValuesHint)
      Sets a hint as to whether there are expected to be a small number of distinct values. If not, we eagerly sort each column up-front and perform a parallelized computation of quantiles and frequent items. This setting is ignored if detail level is not DetailLevel.MULTI_PASS.
      Parameters:
      fewDistinctValuesHint - whether few distinct values are expected
    • compose

      protected void compose(CompositionContext ctx)
      Description copied from class: CompositeOperator
      Compose the body of this operator. Implementations should do the following:
      1. Perform any validation of configuration, input types, etc
      2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
      3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
      Specified by:
      compose in class CompositeOperator
      Parameters:
      ctx - the context