DecisionTreeLearner (Dataflow Library Distribution Project 6.6.0-111 API)

java.lang.Object
- com.pervasive.datarush.operators.AbstractLogicalOperator
- - com.pervasive.datarush.operators.IterativeOperator
  - - com.pervasive.datarush.analytics.decisiontree.learner.DecisionTreeLearner

All Implemented Interfaces:

LogicalOperator, Serializable
```
public final class DecisionTreeLearner
extends IterativeOperator
implements Serializable
```
Operator responsible for constructing a Decision Tree. The implementation is based primarily on Ross Quinlan's book "C4.5: Programs for Machine Learning". Quinlan's C4.5 implementation ( and this implementation ) have the following key features/limitations:
1. Support for both numerical and categorical attributes.
2. Supports only categorical predictions.
3. Uses Information Gain/Gain Ratio as the measure of quality.
4. Missing value handling is done using fractional cases. This corresponds to the PMML "aggregateNodes" missing value strategy.
The following are differences between this implementation and Quinlan's C4.5 implementation:
1. Parallel/distributed implementation
2. Scales to data sets that are too large for memory
3. Does not support C4.5 rule generation. C4.5 is a software distribution that includes several executables. Our primary focus is the decision tree itself.
4. Does not support "subtree raising" as part of the pruning strategy. This adds substantial processing time and is of arguable benefit.
5. Currently limited to single-tree, no support for automatic cross-validation and tree selection.
Implementation, from a scaling and parallelism standpoint, is based on the following papers:
- SPRINT
- ScalParC
Memory requirements
At a minimum, we require 13 bytes of RAM per row of data in order to support the row mapping datastructure. This is distributed throughout the cluster, so if I have, say, 10 nodes in the cluster and 100 million rows of data, we require 13*100millon/10=130 MB of ram per-node.
If the dataset contains null values, this minimum memory requirement may be larger as we require an extra n*12+12 bytes of book-keeping for each row that must be split between children nodes where n is the number of children of the split.
If the inMemoryDataset option is used, the memory requirements are much larger as the attribute tables must be kept in memory. Attribute tables require roughly 32 bytes per row, per-attribute. In addition, whenever attributes are split, we require working space for the split, so need to calculate for number of attributes+1. Finally, unknown(null) values may impact the memory sizes since splitting on an unknown value requires adding the row in question to both of the children nodes. Note though, that attribute tables are distributed throughout the cluster so memory requirements for attributes do scale out in the same was as for the row mapping structure mentioned above.
See Also:

Serialized Form

Constructor Summary

Constructors
Constructor and Description
`DecisionTreeLearner()` The default constructor.
`DecisionTreeLearner(String targetColumn)` Creates a new instance of `DecisionTreeLearner`, specifying the minimal set of required parameters.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected void`	`computeMetadata(IterativeMetadataContext ctx)` Implementations must adhere to the following contracts
`protected CompositionIterator`	`createIterator(MetadataContext ctx)` Invoked at the start of execution.
`List<String>`	`getIncludedFields()` Returns the list of columns to include for learning.
`RecordPort`	`getInput()` Returns the input port for the training data that is used to build the model.
`int`	`getMaxDistinctNominalValues()` Returns the maximum number of distinct nominal values to allow.
`int`	`getMaxTreeNodes()` Returns the maximum total nodes to allow in the tree.
`int`	`getMinRecordsPerNode()` Returns the minimum number of records per-node to allow in the tree.
`PMMLPort`	`getModel()` Returns the output port that will output the model that is built from the training data.
`QualityMeasure`	`getQualityMeasure()` Returns the measure of quality to be used when determining the best split.
`int`	`getStagingBlockSize()` Returns the `blockSize` that is used when staging the original dataset prior to redistributing the data for training.
`String`	`getTargetColumn()` Returns the name of the column to predict.
`DecisionTreeTraceLevel`	`getTraceLevel()` For debugging tree construction.
`boolean`	`isBinaryNominalSplits()` Returns whether we use subsets of nominal values for splitting.
`boolean`	`isInMemoryDataset()` Returns whether the dataset is to be kept in memory while the decision tree is being build.
`void`	`setBinaryNominalSplits(boolean binaryNominalSplits)` Sets whether we use subsets of nominal values for splitting.
`void`	`setIncludedFields(List<String> includedFields)` Sets the list of columns to include for learning.
`void`	`setInMemoryDataset(boolean inMemoryDataset)` Sets whether the dataset is to be kept in memory while the decision tree is being build.
`void`	`setMaxDistinctNominalValues(int maxDistinctNominalValues)` Sets the maximum number of distinct nominal values to allow.
`void`	`setMaxTreeNodes(int maxTreeNodes)` Sets the maximum total nodes to allow in the tree.
`void`	`setMinRecordsPerNode(int minRecordsPerNode)` Sets the minimum number of records per-node to allow in the tree.
`void`	`setQualityMeasure(QualityMeasure qualityMeasure)` Sets the measure of quality to be used when determining the best split.
`void`	`setStagingBlockSize(int stagingBlockSize)` Configures the `blockSize` that is used when staging the original dataset prior to redistributing the data for training.
`void`	`setTargetColumn(String targetColumn)` Sets the name of the column to predict.
`void`	`setTraceLevel(DecisionTreeTraceLevel traceLevel)` For debugging tree construction.

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - DecisionTreeLearner
```
public DecisionTreeLearner()
```
    The default constructor. Prior to graph compilation the following required properties must be specified or an exception will be raised:
    - targetColumn
  - DecisionTreeLearner
```
public DecisionTreeLearner(String targetColumn)
```
    Creates a new instance of DecisionTreeLearner, specifying the minimal set of required parameters.
    
    Parameters:
    
    targetColumn - the target column to predict. Must be of type StringValued.
- Method Detail
  - getInput
```
public RecordPort getInput()
```
    Returns the input port for the training data that is used to build the model. Fields of type string represent categorical data. Fields of type double represent numerical data.
    
    Returns:
    
    the input port for the training data that is used to build the model
  - getModel
```
public PMMLPort getModel()
```
    Returns the output port that will output the model that is built from the training data.
    
    Returns:
    
    the output model port
  - getTargetColumn
```
public String getTargetColumn()
```
    Returns the name of the column to predict. Must match the name of one of the string fields in input.
    
    Returns:
    
    the name of the column to predict
  - setTargetColumn
```
public void setTargetColumn(String targetColumn)
```
    Sets the name of the column to predict. Must match the name of one of the string fields in input.
    
    Parameters:
    
    targetColumn - the name of the column to predict
  - getQualityMeasure
```
public QualityMeasure getQualityMeasure()
```
    Returns the measure of quality to be used when determining the best split. Defaults to GainRatio
    
    Returns:
    
    the measure of quality to be used when determining the best split
  - setQualityMeasure
```
public void setQualityMeasure(QualityMeasure qualityMeasure)
```
    Sets the measure of quality to be used when determining the best split. Defaults to GainRatio.
    
    Parameters:
    
    qualityMeasure - the measure of quality to be used when determining the best split
  - getMinRecordsPerNode
```
public int getMinRecordsPerNode()
```
    Returns the minimum number of records per-node to allow in the tree. A split will not occur unless at least two children have this minimum number of records. Defaults to 2.
    
    Returns:
    
    the minimum number of records per-node to allow in the tree.
  - setMinRecordsPerNode
```
public void setMinRecordsPerNode(int minRecordsPerNode)
```
    Sets the minimum number of records per-node to allow in the tree. A split will not occur unless at least two children have this minimum number of records. Defaults to 2.
    
    Parameters:
    
    minRecordsPerNode - the minimum number of records per-node to allow in the tree.
  - getMaxTreeNodes
```
public int getMaxTreeNodes()
```
    Returns the maximum total nodes to allow in the tree. Once exceeded, tree growth stops. Guards against unbounded memory growth.
    
    Returns:
    
    the maximum total nodes to allow in the tree.
  - setMaxTreeNodes
```
public void setMaxTreeNodes(int maxTreeNodes)
```
    Sets the maximum total nodes to allow in the tree. Once exceeded, tree growth stops. Guards against unbounded memory growth.
    
    Parameters:
    
    maxTreeNodes - the maximum total nodes to allow in the tree.
  - isBinaryNominalSplits
```
public boolean isBinaryNominalSplits()
```
    Returns whether we use subsets of nominal values for splitting. The number of subsets is determined by the quality measure. If Gain is selected as the splitting criteria, will always choose two subsets. If GainRatio is selected, will choose the number of subsets that maximizes the gain ratio. More children increase both the gain and the splitInfo. Gain ratio is gain/splitInfo, so it serves to balance between the two. Defaults to false.
    
    Returns:
    
    whether we use subsets of nominal values for splitting.
  - setBinaryNominalSplits
```
public void setBinaryNominalSplits(boolean binaryNominalSplits)
```
    Sets whether we use subsets of nominal values for splitting. The number of subsets is determined by the quality measure. If Gain is selected as the splitting criteria, will always choose two subsets. If GainRatio is selected, will choose the number of subsets that maximizes the gain ratio. More children increase both the gain and the splitInfo. Gain ratio is gain/splitInfo, so it serves to balance between the two. Defaults to false.
    
    Parameters:
    
    binaryNominalSplits - whether we use subsets of nominal values for splitting.
  - getTraceLevel
```
public DecisionTreeTraceLevel getTraceLevel()
```
    For debugging tree construction. By default is DecisionTreeTraceLevel.OFF
    
    Returns:
    
    the level to use for debugging.
  - setTraceLevel
```
public void setTraceLevel(DecisionTreeTraceLevel traceLevel)
```
    For debugging tree construction. By default is DecisionTreeTraceLevel.OFF
    
    Parameters:
    
    traceLevel - the level to use for debugging.
  - getIncludedFields
```
public List<String> getIncludedFields()
```
    Returns the list of columns to include for learning. An empty list means all columns of the appropriate types.
    
    Returns:
    
    The list of columns to include
  - setIncludedFields
```
public void setIncludedFields(List<String> includedFields)
```
    Sets the list of columns to include for learning. An empty list means all columns of the appropriate types.
    
    Parameters:
    
    includedFields - The list of columns to include
  - isInMemoryDataset
```
public boolean isInMemoryDataset()
```
    Returns whether the dataset is to be kept in memory while the decision tree is being build. False by default.
    
    Returns:
    
    whether the dataset is to be kept in memory
  - setInMemoryDataset
```
public void setInMemoryDataset(boolean inMemoryDataset)
```
    Sets whether the dataset is to be kept in memory while the decision tree is being build. False by default.
    
    Parameters:
    
    inMemoryDataset - whether the dataset is to be kept in memory
  - getStagingBlockSize
```
public int getStagingBlockSize()
```
    Returns the blockSize that is used when staging the original dataset prior to redistributing the data for training. Because we redistribute the training dataset one column at a time, we use a {#link DatasetStorageFormat.COLUMNAR columnar} staging format. The default value is 1000. Larger values consume more memory whereas smaller values result in more IO.
    
    Returns:
    
    the block size that is used when staging the original dataset.
  - setStagingBlockSize
```
public void setStagingBlockSize(int stagingBlockSize)
```
    Configures the blockSize that is used when staging the original dataset prior to redistributing the data for training. Because we redistribute the training dataset one column at a time, we use a {#link DatasetStorageFormat.COLUMNAR columnar} staging format. The default value is 1000. Larger values consume more memory whereas smaller values result in more IO.
    
    Parameters:
    
    stagingBlockSize - the block size that is used when staging the original dataset.
  - getMaxDistinctNominalValues
```
public int getMaxDistinctNominalValues()
```
    Returns the maximum number of distinct nominal values to allow. Attributes with more than this number of distinct values will be filtered from the model. The default is 20.
    
    Returns:
    
    the maximum number of distinct nominal values to allow.
  - setMaxDistinctNominalValues
```
public void setMaxDistinctNominalValues(int maxDistinctNominalValues)
```
    Sets the maximum number of distinct nominal values to allow. Attributes with more than this number of distinct values will be filtered from the model. The default is 20.
  - computeMetadata
```
protected void computeMetadata(IterativeMetadataContext ctx)
```
    Description copied from class: IterativeOperator
    Implementations must adhere to the following contracts
    General
    Regardless of input ports/output port types, all implementations must do the following:
    1. Validation. Validation of configuration should always be performed first.
    2. Declare operator parallelizability. Implementations must declare by calling IterativeMetadataContext.parallelize(ParallelismStrategy).
    3. Declare output port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setOutputParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
    4. Declare input port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setIterationParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean).
    NOTE: There is a convenience method for performing steps 2-4 for the case where all record ports are parallelizable and where we are to determine parallelism based on source:
    - MetadataUtil#negotiateParallelismBasedOnSourceAssumingParallelizableRecords
    Input record ports
    Implementations with input record ports must declare the following:
    1. Required data ordering:
    2. Required data distribution (only applies to parallelizable input ports):
    Note that if the upstream operator's output distribution/ordering is compatible with those required, we avoid a re-sort/re-distribution which is generally a very large savings from a performance standpoint.
    Output record ports (static metadata)
    Implementations with output record ports must declare the following:
    1. Type: Implementations must declare their output type by calling RecordPort#setType.
    Implementations with output record ports may declare the following:
    1. Output data ordering: Implementations that can make guarantees as to their output ordering may do so by calling RecordPort#setOutputDataOrdering
    2. Output data distribution (only applies to parallelizable output ports): Implementations that can make guarantees as to their output distribution may do so by calling RecordPort#setOutputDataDistribution
    Note that both of these properties are optional; if unspecified, performance may suffer since the framework may unnecessarily re-sort/re-distributed the data.
    Input model ports
    In general, iterative operators will tend not to have model input ports, but if so, there is nothing special to declare for input model ports. Models are implicitly duplicated to all partitions when going from non-parallel to parallel ports.
    Output model ports (static metadata)
    SimpleModelPort's have no associated metadata and therefore there is never any output metadata to declare. PMMLPort's, on the other hand, do have associated metadata. For all PMMLPorts, implementations must declare the following:
    1. pmmlModelSpec: Implementations must declare the PMML model spec by calling PMMLPort.setPMMLModelSpec.
    Output ports with dynamic metadata
    If an output port has dynamic metadata, implementations can declare by calling IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean). In the case that metadata is dynamic, calls to RecordPort#setType, RecordPort#setOutputDataOrdering, etc are not allowed and thus the sections above entitled "Output record ports (static metadata)" and "Output model ports (static metadata)" must be skipped. Note that, if possible, dynamic metadata should be avoided (see IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)).
    Specified by:
    
    computeMetadata in class IterativeOperator
    
    Parameters:
    
    ctx - the context
  - createIterator
```
protected CompositionIterator createIterator(MetadataContext ctx)
```
    Description copied from class: IterativeOperator
    
    Invoked at the start of execution. The iterator is expected to return a handle that is then used for execution.
    
    Specified by:
    
    createIterator in class IterativeOperator
    
    Parameters:
    
    ctx - a context in which the iterative operator can find input port metadata, etc. this information was available in the previous call to IterativeOperator.computeMetadata(IterativeMetadataContext), but is available here as well so that the iterative operator need not cache any metadata in its instance variables.
    
    Returns:
    
    a handle that is used for iteration

Class DecisionTreeLearner

Memory requirements

Constructor Summary

Method Summary

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator

Methods inherited from class java.lang.Object

Constructor Detail

DecisionTreeLearner

DecisionTreeLearner

Method Detail

getInput

getModel

getTargetColumn

setTargetColumn

getQualityMeasure

setQualityMeasure

getMinRecordsPerNode

setMinRecordsPerNode

getMaxTreeNodes

setMaxTreeNodes

isBinaryNominalSplits

setBinaryNominalSplits

getTraceLevel

setTraceLevel

getIncludedFields

setIncludedFields

isInMemoryDataset

setInMemoryDataset

getStagingBlockSize

setStagingBlockSize

getMaxDistinctNominalValues

setMaxDistinctNominalValues

computeMetadata

General

Input record ports

Output record ports (static metadata)

Input model ports

Output model ports (static metadata)

Output ports with dynamic metadata

createIterator