Class DecisionTreeLearner
- java.lang.Object
-
- com.pervasive.datarush.operators.AbstractLogicalOperator
-
- com.pervasive.datarush.operators.IterativeOperator
-
- com.pervasive.datarush.analytics.decisiontree.learner.DecisionTreeLearner
-
- All Implemented Interfaces:
LogicalOperator
,Serializable
public final class DecisionTreeLearner extends IterativeOperator implements Serializable
Operator responsible for constructing a Decision Tree. The implementation is based primarily on Ross Quinlan's book "C4.5: Programs for Machine Learning". Quinlan's C4.5 implementation ( and this implementation ) have the following key features/limitations:- Support for both numerical and categorical attributes.
- Supports only categorical predictions.
- Uses Information Gain/Gain Ratio as the measure of quality.
- Missing value handling is done using fractional cases. This corresponds to the PMML "aggregateNodes" missing value strategy.
- Parallel/distributed implementation
- Scales to data sets that are too large for memory
- Does not support C4.5 rule generation. C4.5 is a software distribution that includes several executables. Our primary focus is the decision tree itself.
- Does not support "subtree raising" as part of the pruning strategy. This adds substantial processing time and is of arguable benefit.
- Currently limited to single-tree, no support for automatic cross-validation and tree selection.
- SPRINT
- ScalParC
Memory requirements
At a minimum, we require 13 bytes of RAM per row of data in order to support the row mapping datastructure. This is distributed throughout the cluster, so if I have, say, 10 nodes in the cluster and 100 million rows of data, we require 13*100millon/10=130 MB of ram per-node.If the dataset contains null values, this minimum memory requirement may be larger as we require an extra
n*12+12
bytes of book-keeping for each row that must be split between children nodes wheren
is the number of children of the split.If the
inMemoryDataset
option is used, the memory requirements are much larger as the attribute tables must be kept in memory. Attribute tables require roughly 32 bytes per row, per-attribute. In addition, whenever attributes are split, we require working space for the split, so need to calculate for number of attributes+1. Finally, unknown(null) values may impact the memory sizes since splitting on an unknown value requires adding the row in question to both of the children nodes. Note though, that attribute tables are distributed throughout the cluster so memory requirements for attributes do scale out in the same was as for the row mapping structure mentioned above.- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description DecisionTreeLearner()
The default constructor.DecisionTreeLearner(String targetColumn)
Creates a new instance ofDecisionTreeLearner
, specifying the minimal set of required parameters.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
computeMetadata(IterativeMetadataContext ctx)
Implementations must adhere to the following contractsprotected CompositionIterator
createIterator(MetadataContext ctx)
Invoked at the start of execution.List<String>
getIncludedFields()
Returns the list of columns to include for learning.RecordPort
getInput()
Returns the input port for the training data that is used to build the model.int
getMaxDistinctNominalValues()
Returns the maximum number of distinct nominal values to allow.int
getMaxTreeNodes()
Returns the maximum total nodes to allow in the tree.int
getMinRecordsPerNode()
Returns the minimum number of records per-node to allow in the tree.PMMLPort
getModel()
Returns the output port that will output the model that is built from the training data.QualityMeasure
getQualityMeasure()
Returns the measure of quality to be used when determining the best split.int
getStagingBlockSize()
Returns theblockSize
that is used when staging the original dataset prior to redistributing the data for training.String
getTargetColumn()
Returns the name of the column to predict.DecisionTreeTraceLevel
getTraceLevel()
For debugging tree construction.boolean
isBinaryNominalSplits()
Returns whether we use subsets of nominal values for splitting.boolean
isInMemoryDataset()
Returns whether the dataset is to be kept in memory while the decision tree is being build.void
setBinaryNominalSplits(boolean binaryNominalSplits)
Sets whether we use subsets of nominal values for splitting.void
setIncludedFields(List<String> includedFields)
Sets the list of columns to include for learning.void
setInMemoryDataset(boolean inMemoryDataset)
Sets whether the dataset is to be kept in memory while the decision tree is being build.void
setMaxDistinctNominalValues(int maxDistinctNominalValues)
Sets the maximum number of distinct nominal values to allow.void
setMaxTreeNodes(int maxTreeNodes)
Sets the maximum total nodes to allow in the tree.void
setMinRecordsPerNode(int minRecordsPerNode)
Sets the minimum number of records per-node to allow in the tree.void
setQualityMeasure(QualityMeasure qualityMeasure)
Sets the measure of quality to be used when determining the best split.void
setStagingBlockSize(int stagingBlockSize)
Configures theblockSize
that is used when staging the original dataset prior to redistributing the data for training.void
setTargetColumn(String targetColumn)
Sets the name of the column to predict.void
setTraceLevel(DecisionTreeTraceLevel traceLevel)
For debugging tree construction.-
Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
-
-
-
Constructor Detail
-
DecisionTreeLearner
public DecisionTreeLearner()
The default constructor. Prior to graph compilation the following required properties must be specified or an exception will be raised:
-
DecisionTreeLearner
public DecisionTreeLearner(String targetColumn)
Creates a new instance ofDecisionTreeLearner
, specifying the minimal set of required parameters.- Parameters:
targetColumn
- the target column to predict. Must be of typeStringValued
.
-
-
Method Detail
-
getInput
public RecordPort getInput()
Returns the input port for the training data that is used to build the model. Fields of type string represent categorical data. Fields of type double represent numerical data.- Returns:
- the input port for the training data that is used to build the model
-
getModel
public PMMLPort getModel()
Returns the output port that will output the model that is built from the training data.- Returns:
- the output model port
-
getTargetColumn
public String getTargetColumn()
Returns the name of the column to predict. Must match the name of one of the string fields ininput
.- Returns:
- the name of the column to predict
-
setTargetColumn
public void setTargetColumn(String targetColumn)
Sets the name of the column to predict. Must match the name of one of the string fields ininput
.- Parameters:
targetColumn
- the name of the column to predict
-
getQualityMeasure
public QualityMeasure getQualityMeasure()
Returns the measure of quality to be used when determining the best split. Defaults toGainRatio
- Returns:
- the measure of quality to be used when determining the best split
-
setQualityMeasure
public void setQualityMeasure(QualityMeasure qualityMeasure)
Sets the measure of quality to be used when determining the best split. Defaults toGainRatio
.- Parameters:
qualityMeasure
- the measure of quality to be used when determining the best split
-
getMinRecordsPerNode
public int getMinRecordsPerNode()
Returns the minimum number of records per-node to allow in the tree. A split will not occur unless at least two children have this minimum number of records. Defaults to 2.- Returns:
- the minimum number of records per-node to allow in the tree.
-
setMinRecordsPerNode
public void setMinRecordsPerNode(int minRecordsPerNode)
Sets the minimum number of records per-node to allow in the tree. A split will not occur unless at least two children have this minimum number of records. Defaults to 2.- Parameters:
minRecordsPerNode
- the minimum number of records per-node to allow in the tree.
-
getMaxTreeNodes
public int getMaxTreeNodes()
Returns the maximum total nodes to allow in the tree. Once exceeded, tree growth stops. Guards against unbounded memory growth.- Returns:
- the maximum total nodes to allow in the tree.
-
setMaxTreeNodes
public void setMaxTreeNodes(int maxTreeNodes)
Sets the maximum total nodes to allow in the tree. Once exceeded, tree growth stops. Guards against unbounded memory growth.- Parameters:
maxTreeNodes
- the maximum total nodes to allow in the tree.
-
isBinaryNominalSplits
public boolean isBinaryNominalSplits()
Returns whether we use subsets of nominal values for splitting. The number of subsets is determined by the quality measure. IfGain
is selected as the splitting criteria, will always choose two subsets. IfGainRatio
is selected, will choose the number of subsets that maximizes the gain ratio. More children increase both the gain and the splitInfo. Gain ratio is gain/splitInfo, so it serves to balance between the two. Defaults to false.- Returns:
- whether we use subsets of nominal values for splitting.
-
setBinaryNominalSplits
public void setBinaryNominalSplits(boolean binaryNominalSplits)
Sets whether we use subsets of nominal values for splitting. The number of subsets is determined by the quality measure. IfGain
is selected as the splitting criteria, will always choose two subsets. IfGainRatio
is selected, will choose the number of subsets that maximizes the gain ratio. More children increase both the gain and the splitInfo. Gain ratio is gain/splitInfo, so it serves to balance between the two. Defaults to false.- Parameters:
binaryNominalSplits
- whether we use subsets of nominal values for splitting.
-
getTraceLevel
public DecisionTreeTraceLevel getTraceLevel()
For debugging tree construction. By default isDecisionTreeTraceLevel.OFF
- Returns:
- the level to use for debugging.
-
setTraceLevel
public void setTraceLevel(DecisionTreeTraceLevel traceLevel)
For debugging tree construction. By default isDecisionTreeTraceLevel.OFF
- Parameters:
traceLevel
- the level to use for debugging.
-
getIncludedFields
public List<String> getIncludedFields()
Returns the list of columns to include for learning. An empty list means all columns of the appropriate types.- Returns:
- The list of columns to include
-
setIncludedFields
public void setIncludedFields(List<String> includedFields)
Sets the list of columns to include for learning. An empty list means all columns of the appropriate types.- Parameters:
includedFields
- The list of columns to include
-
isInMemoryDataset
public boolean isInMemoryDataset()
Returns whether the dataset is to be kept in memory while the decision tree is being build. False by default.- Returns:
- whether the dataset is to be kept in memory
-
setInMemoryDataset
public void setInMemoryDataset(boolean inMemoryDataset)
Sets whether the dataset is to be kept in memory while the decision tree is being build. False by default.- Parameters:
inMemoryDataset
- whether the dataset is to be kept in memory
-
getStagingBlockSize
public int getStagingBlockSize()
Returns theblockSize
that is used when staging the original dataset prior to redistributing the data for training. Because we redistribute the training dataset one column at a time, we use a {#link DatasetStorageFormat.COLUMNAR columnar} staging format. The default value is 1000. Larger values consume more memory whereas smaller values result in more IO.- Returns:
- the block size that is used when staging the original dataset.
-
setStagingBlockSize
public void setStagingBlockSize(int stagingBlockSize)
Configures theblockSize
that is used when staging the original dataset prior to redistributing the data for training. Because we redistribute the training dataset one column at a time, we use a {#link DatasetStorageFormat.COLUMNAR columnar} staging format. The default value is 1000. Larger values consume more memory whereas smaller values result in more IO.- Parameters:
stagingBlockSize
- the block size that is used when staging the original dataset.
-
getMaxDistinctNominalValues
public int getMaxDistinctNominalValues()
Returns the maximum number of distinct nominal values to allow. Attributes with more than this number of distinct values will be filtered from the model. The default is 20.- Returns:
- the maximum number of distinct nominal values to allow.
-
setMaxDistinctNominalValues
public void setMaxDistinctNominalValues(int maxDistinctNominalValues)
Sets the maximum number of distinct nominal values to allow. Attributes with more than this number of distinct values will be filtered from the model. The default is 20.
-
computeMetadata
protected void computeMetadata(IterativeMetadataContext ctx)
Description copied from class:IterativeOperator
Implementations must adhere to the following contractsGeneral
Regardless of input ports/output port types, all implementations must do the following:- Validation. Validation of configuration should always be performed first.
- Declare operator parallelizability. Implementations must declare by calling
IterativeMetadataContext.parallelize(ParallelismStrategy)
. - Declare output port parallelizablility. Implementations must declare by calling
IterativeMetadataContext.setOutputParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
- Declare input port parallelizablility. Implementations must declare by calling
IterativeMetadataContext.setIterationParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
.
MetadataUtil#negotiateParallelismBasedOnSourceAssumingParallelizableRecords
Input record ports
Implementations with input record ports must declare the following:- Required data ordering: Implementations that have data ordering requirements must declare them by calling
- Required data distribution (only applies to parallelizable input ports): Implementations that have data distribution requirements must declare them by calling
RecordPort#setRequiredDataOrdering
, otherwise iteration will proceed on an input dataset whose order is undefined.RecordPort#setRequiredDataDistribution
, otherwise iteration will proceed on an input dataset whose distribution is theunspecified partial distribution
.Output record ports (static metadata)
Implementations with output record ports must declare the following:- Type: Implementations must declare their output type by calling
RecordPort#setType
.
- Output data ordering: Implementations that can make guarantees as to their output
ordering may do so by calling
RecordPort#setOutputDataOrdering
- Output data distribution (only applies to parallelizable output ports): Implementations that can make guarantees as to their output
distribution may do so by calling
RecordPort#setOutputDataDistribution
Input model ports
In general, iterative operators will tend not to have model input ports, but if so, there is nothing special to declare for input model ports. Models are implicitly duplicated to all partitions when going from non-parallel to parallel ports.Output model ports (static metadata)
SimpleModelPort
's have no associated metadata and therefore there is never any output metadata to declare.PMMLPort
's, on the other hand, do have associated metadata. For all PMMLPorts, implementations must declare the following:- pmmlModelSpec: Implementations must declare the PMML model spec
by calling
PMMLPort.setPMMLModelSpec
.
Output ports with dynamic metadata
If an output port has dynamic metadata, implementations can declare by callingIterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)
. In the case that metadata is dynamic, calls toRecordPort#setType
,RecordPort#setOutputDataOrdering
, etc are not allowed and thus the sections above entitled "Output record ports (static metadata)" and "Output model ports (static metadata)" must be skipped. Note that, if possible, dynamic metadata should be avoided (seeIterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)
).- Specified by:
computeMetadata
in classIterativeOperator
- Parameters:
ctx
- the context
-
createIterator
protected CompositionIterator createIterator(MetadataContext ctx)
Description copied from class:IterativeOperator
Invoked at the start of execution. The iterator is expected to return a handle that is then used for execution.- Specified by:
createIterator
in classIterativeOperator
- Parameters:
ctx
- a context in which the iterative operator can find input port metadata, etc. this information was available in the previous call toIterativeOperator.computeMetadata(IterativeMetadataContext)
, but is available here as well so that the iterative operator need not cache any metadata in its instance variables.- Returns:
- a handle that is used for iteration
-
-