public final class DecisionTreeLearner extends IterativeOperator implements Serializable
If the dataset contains null values, this minimum memory requirement may be larger as
we require an extra n*12+12
bytes of book-keeping for each row that must be split between
children nodes where n
is the number of children of the split.
If the inMemoryDataset
option is used,
the memory requirements are much larger as the attribute tables must be kept
in memory. Attribute tables require roughly 32 bytes per row, per-attribute.
In addition, whenever attributes are split, we require working space for the split, so
need to calculate for number of attributes+1. Finally, unknown(null) values may impact the memory sizes since
splitting on an unknown value requires adding the row in question to both of the children nodes.
Note though, that attribute tables are distributed throughout the cluster so
memory requirements for attributes do scale out in the same
was as for the row mapping structure mentioned above.
Constructor and Description |
---|
DecisionTreeLearner()
The default constructor.
|
DecisionTreeLearner(String targetColumn)
Creates a new instance of
DecisionTreeLearner , specifying
the minimal set of required parameters. |
Modifier and Type | Method and Description |
---|---|
protected void |
computeMetadata(IterativeMetadataContext ctx)
Implementations must adhere to the following contracts
|
protected CompositionIterator |
createIterator(MetadataContext ctx)
Invoked at the start of execution.
|
List<String> |
getIncludedFields()
Returns the list of columns to include for learning.
|
RecordPort |
getInput()
Returns the input port for the training data that is used to build the model.
|
int |
getMaxDistinctNominalValues()
Returns the maximum number of distinct nominal values to allow.
|
int |
getMaxTreeNodes()
Returns the maximum total nodes to allow in the tree.
|
int |
getMinRecordsPerNode()
Returns the minimum number of records per-node to allow in the tree.
|
PMMLPort |
getModel()
Returns the output port that will output the model that is built from the
training data.
|
QualityMeasure |
getQualityMeasure()
Returns the measure of quality to be used when determining the best split.
|
int |
getStagingBlockSize()
Returns the
blockSize that is used when staging
the original dataset prior to redistributing the data for training. |
String |
getTargetColumn()
Returns the name of the column to predict.
|
DecisionTreeTraceLevel |
getTraceLevel()
For debugging tree construction.
|
boolean |
isBinaryNominalSplits()
Returns whether we use subsets of nominal values for splitting.
|
boolean |
isInMemoryDataset()
Returns whether the dataset is to be kept in memory while the decision tree is
being build.
|
void |
setBinaryNominalSplits(boolean binaryNominalSplits)
Sets whether we use subsets of nominal values for splitting.
|
void |
setIncludedFields(List<String> includedFields)
Sets the list of columns to include for learning.
|
void |
setInMemoryDataset(boolean inMemoryDataset)
Sets whether the dataset is to be kept in memory while the decision tree is
being build.
|
void |
setMaxDistinctNominalValues(int maxDistinctNominalValues)
Sets the maximum number of distinct nominal values to allow.
|
void |
setMaxTreeNodes(int maxTreeNodes)
Sets the maximum total nodes to allow in the tree.
|
void |
setMinRecordsPerNode(int minRecordsPerNode)
Sets the minimum number of records per-node to allow in the tree.
|
void |
setQualityMeasure(QualityMeasure qualityMeasure)
Sets the measure of quality to be used when determining the best split.
|
void |
setStagingBlockSize(int stagingBlockSize)
Configures the
blockSize that is used when staging
the original dataset prior to redistributing the data for training. |
void |
setTargetColumn(String targetColumn)
Sets the name of the column to predict.
|
void |
setTraceLevel(DecisionTreeTraceLevel traceLevel)
For debugging tree construction.
|
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
public DecisionTreeLearner()
public DecisionTreeLearner(String targetColumn)
DecisionTreeLearner
, specifying
the minimal set of required parameters.targetColumn
- the target column to predict. Must be of type StringValued
.public RecordPort getInput()
public PMMLPort getModel()
public String getTargetColumn()
input
.public void setTargetColumn(String targetColumn)
input
.targetColumn
- the name of the column to predictpublic QualityMeasure getQualityMeasure()
GainRatio
public void setQualityMeasure(QualityMeasure qualityMeasure)
GainRatio
.qualityMeasure
- the measure of quality to be used when determining the best splitpublic int getMinRecordsPerNode()
public void setMinRecordsPerNode(int minRecordsPerNode)
minRecordsPerNode
- the minimum number of records per-node to allow in the tree.public int getMaxTreeNodes()
public void setMaxTreeNodes(int maxTreeNodes)
maxTreeNodes
- the maximum total nodes to allow in the tree.public boolean isBinaryNominalSplits()
Gain
is selected as the splitting criteria, will always choose two subsets.
If GainRatio
is selected, will choose the number of subsets that
maximizes the gain ratio. More children increase both the gain and the splitInfo.
Gain ratio is gain/splitInfo, so it serves to balance between the two. Defaults to false.public void setBinaryNominalSplits(boolean binaryNominalSplits)
Gain
is selected as the splitting criteria, will always choose two subsets.
If GainRatio
is selected, will choose the number of subsets that
maximizes the gain ratio. More children increase both the gain and the splitInfo.
Gain ratio is gain/splitInfo, so it serves to balance between the two. Defaults to false.binaryNominalSplits
- whether we use subsets of nominal values for splitting.public DecisionTreeTraceLevel getTraceLevel()
DecisionTreeTraceLevel.OFF
public void setTraceLevel(DecisionTreeTraceLevel traceLevel)
DecisionTreeTraceLevel.OFF
traceLevel
- the level to use for debugging.public List<String> getIncludedFields()
public void setIncludedFields(List<String> includedFields)
includedFields
- The list of columns to includepublic boolean isInMemoryDataset()
public void setInMemoryDataset(boolean inMemoryDataset)
inMemoryDataset
- whether the dataset is to be kept in memorypublic int getStagingBlockSize()
blockSize
that is used when staging
the original dataset prior to redistributing the data for training. Because we redistribute
the training dataset one column at a time, we use a {#link DatasetStorageFormat.COLUMNAR columnar}
staging format. The default value is 1000. Larger values consume more memory whereas smaller
values result in more IO.public void setStagingBlockSize(int stagingBlockSize)
blockSize
that is used when staging
the original dataset prior to redistributing the data for training. Because we redistribute
the training dataset one column at a time, we use a {#link DatasetStorageFormat.COLUMNAR columnar}
staging format. The default value is 1000. Larger values consume more memory whereas smaller
values result in more IO.stagingBlockSize
- the block size that is used when staging the original dataset.public int getMaxDistinctNominalValues()
public void setMaxDistinctNominalValues(int maxDistinctNominalValues)
protected void computeMetadata(IterativeMetadataContext ctx)
IterativeOperator
IterativeMetadataContext.parallelize(ParallelismStrategy)
.
IterativeMetadataContext.setOutputParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
IterativeMetadataContext.setIterationParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
.MetadataUtil#negotiateParallelismBasedOnSourceAssumingParallelizableRecords
RecordPort#setRequiredDataOrdering
, otherwise iteration will proceed on an input dataset whose order is undefined.
RecordPort#setRequiredDataDistribution
, otherwise iteration will proceed on an input dataset whose distribution is the unspecified partial distribution
.
RecordPort#setType
.RecordPort#setOutputDataOrdering
RecordPort#setOutputDataDistribution
SimpleModelPort
's have no associated metadata and therefore there is
never any output metadata to declare. PMMLPort
's, on the other hand,
do have associated metadata. For all PMMLPorts, implementations must declare
the following:
PMMLPort.setPMMLModelSpec
.
IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)
. In the case that metadata
is dynamic, calls to RecordPort#setType
, RecordPort#setOutputDataOrdering
,
etc are not allowed and thus the sections above entitled "Output record ports (static metadata)"
and "Output model ports (static metadata)" must be skipped. Note that, if possible,
dynamic metadata should be avoided (see IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)
).
computeMetadata
in class IterativeOperator
ctx
- the contextprotected CompositionIterator createIterator(MetadataContext ctx)
IterativeOperator
createIterator
in class IterativeOperator
ctx
- a context in which the iterative operator can find input port metadata, etc.
this information was available in the previous call to IterativeOperator.computeMetadata(IterativeMetadataContext)
,
but is available here as well so that the iterative operator need not cache any
metadata in its instance variables.Copyright © 2016 Actian Corporation. All rights reserved.