Class DecisionTreeLearner

  • All Implemented Interfaces:
    LogicalOperator, Serializable

    public final class DecisionTreeLearner
    extends IterativeOperator
    implements Serializable
    Operator responsible for constructing a Decision Tree. The implementation is based primarily on Ross Quinlan's book "C4.5: Programs for Machine Learning". Quinlan's C4.5 implementation ( and this implementation ) have the following key features/limitations:
    1. Support for both numerical and categorical attributes.
    2. Supports only categorical predictions.
    3. Uses Information Gain/Gain Ratio as the measure of quality.
    4. Missing value handling is done using fractional cases. This corresponds to the PMML "aggregateNodes" missing value strategy.
    The following are differences between this implementation and Quinlan's C4.5 implementation:
    1. Parallel/distributed implementation
    2. Scales to data sets that are too large for memory
    3. Does not support C4.5 rule generation. C4.5 is a software distribution that includes several executables. Our primary focus is the decision tree itself.
    4. Does not support "subtree raising" as part of the pruning strategy. This adds substantial processing time and is of arguable benefit.
    5. Currently limited to single-tree, no support for automatic cross-validation and tree selection.
    Implementation, from a scaling and parallelism standpoint, is based on the following papers:

    Memory requirements

    At a minimum, we require 13 bytes of RAM per row of data in order to support the row mapping datastructure. This is distributed throughout the cluster, so if I have, say, 10 nodes in the cluster and 100 million rows of data, we require 13*100millon/10=130 MB of ram per-node.

    If the dataset contains null values, this minimum memory requirement may be larger as we require an extra n*12+12 bytes of book-keeping for each row that must be split between children nodes where n is the number of children of the split.

    If the inMemoryDataset option is used, the memory requirements are much larger as the attribute tables must be kept in memory. Attribute tables require roughly 32 bytes per row, per-attribute. In addition, whenever attributes are split, we require working space for the split, so need to calculate for number of attributes+1. Finally, unknown(null) values may impact the memory sizes since splitting on an unknown value requires adding the row in question to both of the children nodes. Note though, that attribute tables are distributed throughout the cluster so memory requirements for attributes do scale out in the same was as for the row mapping structure mentioned above.

    See Also:
    Serialized Form
    • Constructor Detail

      • DecisionTreeLearner

        public DecisionTreeLearner()
        The default constructor. Prior to graph compilation the following required properties must be specified or an exception will be raised:
      • DecisionTreeLearner

        public DecisionTreeLearner​(String targetColumn)
        Creates a new instance of DecisionTreeLearner, specifying the minimal set of required parameters.
        Parameters:
        targetColumn - the target column to predict. Must be of type StringValued.
    • Method Detail

      • getInput

        public RecordPort getInput()
        Returns the input port for the training data that is used to build the model. Fields of type string represent categorical data. Fields of type double represent numerical data.
        Returns:
        the input port for the training data that is used to build the model
      • getModel

        public PMMLPort getModel()
        Returns the output port that will output the model that is built from the training data.
        Returns:
        the output model port
      • getTargetColumn

        public String getTargetColumn()
        Returns the name of the column to predict. Must match the name of one of the string fields in input.
        Returns:
        the name of the column to predict
      • setTargetColumn

        public void setTargetColumn​(String targetColumn)
        Sets the name of the column to predict. Must match the name of one of the string fields in input.
        Parameters:
        targetColumn - the name of the column to predict
      • getQualityMeasure

        public QualityMeasure getQualityMeasure()
        Returns the measure of quality to be used when determining the best split. Defaults to GainRatio
        Returns:
        the measure of quality to be used when determining the best split
      • setQualityMeasure

        public void setQualityMeasure​(QualityMeasure qualityMeasure)
        Sets the measure of quality to be used when determining the best split. Defaults to GainRatio.
        Parameters:
        qualityMeasure - the measure of quality to be used when determining the best split
      • getMinRecordsPerNode

        public int getMinRecordsPerNode()
        Returns the minimum number of records per-node to allow in the tree. A split will not occur unless at least two children have this minimum number of records. Defaults to 2.
        Returns:
        the minimum number of records per-node to allow in the tree.
      • setMinRecordsPerNode

        public void setMinRecordsPerNode​(int minRecordsPerNode)
        Sets the minimum number of records per-node to allow in the tree. A split will not occur unless at least two children have this minimum number of records. Defaults to 2.
        Parameters:
        minRecordsPerNode - the minimum number of records per-node to allow in the tree.
      • getMaxTreeNodes

        public int getMaxTreeNodes()
        Returns the maximum total nodes to allow in the tree. Once exceeded, tree growth stops. Guards against unbounded memory growth.
        Returns:
        the maximum total nodes to allow in the tree.
      • setMaxTreeNodes

        public void setMaxTreeNodes​(int maxTreeNodes)
        Sets the maximum total nodes to allow in the tree. Once exceeded, tree growth stops. Guards against unbounded memory growth.
        Parameters:
        maxTreeNodes - the maximum total nodes to allow in the tree.
      • isBinaryNominalSplits

        public boolean isBinaryNominalSplits()
        Returns whether we use subsets of nominal values for splitting. The number of subsets is determined by the quality measure. If Gain is selected as the splitting criteria, will always choose two subsets. If GainRatio is selected, will choose the number of subsets that maximizes the gain ratio. More children increase both the gain and the splitInfo. Gain ratio is gain/splitInfo, so it serves to balance between the two. Defaults to false.
        Returns:
        whether we use subsets of nominal values for splitting.
      • setBinaryNominalSplits

        public void setBinaryNominalSplits​(boolean binaryNominalSplits)
        Sets whether we use subsets of nominal values for splitting. The number of subsets is determined by the quality measure. If Gain is selected as the splitting criteria, will always choose two subsets. If GainRatio is selected, will choose the number of subsets that maximizes the gain ratio. More children increase both the gain and the splitInfo. Gain ratio is gain/splitInfo, so it serves to balance between the two. Defaults to false.
        Parameters:
        binaryNominalSplits - whether we use subsets of nominal values for splitting.
      • getIncludedFields

        public List<String> getIncludedFields()
        Returns the list of columns to include for learning. An empty list means all columns of the appropriate types.
        Returns:
        The list of columns to include
      • setIncludedFields

        public void setIncludedFields​(List<String> includedFields)
        Sets the list of columns to include for learning. An empty list means all columns of the appropriate types.
        Parameters:
        includedFields - The list of columns to include
      • isInMemoryDataset

        public boolean isInMemoryDataset()
        Returns whether the dataset is to be kept in memory while the decision tree is being build. False by default.
        Returns:
        whether the dataset is to be kept in memory
      • setInMemoryDataset

        public void setInMemoryDataset​(boolean inMemoryDataset)
        Sets whether the dataset is to be kept in memory while the decision tree is being build. False by default.
        Parameters:
        inMemoryDataset - whether the dataset is to be kept in memory
      • getStagingBlockSize

        public int getStagingBlockSize()
        Returns the blockSize that is used when staging the original dataset prior to redistributing the data for training. Because we redistribute the training dataset one column at a time, we use a {#link DatasetStorageFormat.COLUMNAR columnar} staging format. The default value is 1000. Larger values consume more memory whereas smaller values result in more IO.
        Returns:
        the block size that is used when staging the original dataset.
      • setStagingBlockSize

        public void setStagingBlockSize​(int stagingBlockSize)
        Configures the blockSize that is used when staging the original dataset prior to redistributing the data for training. Because we redistribute the training dataset one column at a time, we use a {#link DatasetStorageFormat.COLUMNAR columnar} staging format. The default value is 1000. Larger values consume more memory whereas smaller values result in more IO.
        Parameters:
        stagingBlockSize - the block size that is used when staging the original dataset.
      • getMaxDistinctNominalValues

        public int getMaxDistinctNominalValues()
        Returns the maximum number of distinct nominal values to allow. Attributes with more than this number of distinct values will be filtered from the model. The default is 20.
        Returns:
        the maximum number of distinct nominal values to allow.
      • setMaxDistinctNominalValues

        public void setMaxDistinctNominalValues​(int maxDistinctNominalValues)
        Sets the maximum number of distinct nominal values to allow. Attributes with more than this number of distinct values will be filtered from the model. The default is 20.
      • computeMetadata

        protected void computeMetadata​(IterativeMetadataContext ctx)
        Description copied from class: IterativeOperator
        Implementations must adhere to the following contracts

        General

        Regardless of input ports/output port types, all implementations must do the following:

        1. Validation. Validation of configuration should always be performed first.
        2. Declare operator parallelizability. Implementations must declare by calling IterativeMetadataContext.parallelize(ParallelismStrategy).
        3. Declare output port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setOutputParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
        4. Declare input port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setIterationParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean).
        NOTE: There is a convenience method for performing steps 2-4 for the case where all record ports are parallelizable and where we are to determine parallelism based on source:
        • MetadataUtil#negotiateParallelismBasedOnSourceAssumingParallelizableRecords

        Input record ports

        Implementations with input record ports must declare the following:
        1. Required data ordering:
        2. Implementations that have data ordering requirements must declare them by calling RecordPort#setRequiredDataOrdering, otherwise iteration will proceed on an input dataset whose order is undefined.
        3. Required data distribution (only applies to parallelizable input ports):
        4. Implementations that have data distribution requirements must declare them by calling RecordPort#setRequiredDataDistribution, otherwise iteration will proceed on an input dataset whose distribution is the unspecified partial distribution.
        Note that if the upstream operator's output distribution/ordering is compatible with those required, we avoid a re-sort/re-distribution which is generally a very large savings from a performance standpoint.

        Output record ports (static metadata)

        Implementations with output record ports must declare the following:
        1. Type: Implementations must declare their output type by calling RecordPort#setType.
        Implementations with output record ports may declare the following:
        1. Output data ordering: Implementations that can make guarantees as to their output ordering may do so by calling RecordPort#setOutputDataOrdering
        2. Output data distribution (only applies to parallelizable output ports): Implementations that can make guarantees as to their output distribution may do so by calling RecordPort#setOutputDataDistribution
        Note that both of these properties are optional; if unspecified, performance may suffer since the framework may unnecessarily re-sort/re-distributed the data.

        Input model ports

        In general, iterative operators will tend not to have model input ports, but if so, there is nothing special to declare for input model ports. Models are implicitly duplicated to all partitions when going from non-parallel to parallel ports.

        Output model ports (static metadata)

        SimpleModelPort's have no associated metadata and therefore there is never any output metadata to declare. PMMLPort's, on the other hand, do have associated metadata. For all PMMLPorts, implementations must declare the following:
        1. pmmlModelSpec: Implementations must declare the PMML model spec by calling PMMLPort.setPMMLModelSpec.

        Output ports with dynamic metadata

        If an output port has dynamic metadata, implementations can declare by calling IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean). In the case that metadata is dynamic, calls to RecordPort#setType, RecordPort#setOutputDataOrdering, etc are not allowed and thus the sections above entitled "Output record ports (static metadata)" and "Output model ports (static metadata)" must be skipped. Note that, if possible, dynamic metadata should be avoided (see IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)).

        Specified by:
        computeMetadata in class IterativeOperator
        Parameters:
        ctx - the context