java.lang.Object
com.pervasive.datarush.operators.AbstractLogicalOperator
com.pervasive.datarush.operators.IterativeOperator
- All Implemented Interfaces:
LogicalOperator
- Direct Known Subclasses:
DecisionTreeLearner,KMeans,LinearRegressionLearner
To be implemented by operations that must make multiple passes over the input
data. The framework will ensure that the inputs are staged such that the operation can
iterate on them.
The lifecycle for iterators is as-follows:
computeMetadatais called once during graph compilation. This gives operators a chance to validate and declare parallelizability, input metadata, and output metadata.createIteratoris called once during graph execution.executeis invoked once during graph execution to perform the "body" of the iteration.finalCompositionis invoked once during graph execution to give the operator a chance to output final results.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected abstract voidImplementations must adhere to the following contractsprotected abstract CompositionIteratorInvoked at the start of execution.Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
Constructor Details
-
IterativeOperator
public IterativeOperator()
-
-
Method Details
-
computeMetadata
Implementations must adhere to the following contractsGeneral
Regardless of input ports/output port types, all implementations must do the following:- Validation. Validation of configuration should always be performed first.
- Declare operator parallelizability. Implementations must declare by calling
IterativeMetadataContext.parallelize(ParallelismStrategy). - Declare output port parallelizablility. Implementations must declare by calling
IterativeMetadataContext.setOutputParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean) - Declare input port parallelizablility. Implementations must declare by calling
IterativeMetadataContext.setIterationParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean).
MetadataUtil#negotiateParallelismBasedOnSourceAssumingParallelizableRecords
Input record ports
Implementations with input record ports must declare the following:- Required data ordering: Implementations that have data ordering requirements must declare them by calling
- Required data distribution (only applies to parallelizable input ports): Implementations that have data distribution requirements must declare them by calling
RecordPort#setRequiredDataOrdering, otherwise iteration will proceed on an input dataset whose order is undefined.RecordPort#setRequiredDataDistribution, otherwise iteration will proceed on an input dataset whose distribution is theunspecified partial distribution.Output record ports (static metadata)
Implementations with output record ports must declare the following:- Type: Implementations must declare their output type by calling
RecordPort#setType.
- Output data ordering: Implementations that can make guarantees as to their output
ordering may do so by calling
RecordPort#setOutputDataOrdering - Output data distribution (only applies to parallelizable output ports): Implementations that can make guarantees as to their output
distribution may do so by calling
RecordPort#setOutputDataDistribution
Input model ports
In general, iterative operators will tend not to have model input ports, but if so, there is nothing special to declare for input model ports. Models are implicitly duplicated to all partitions when going from non-parallel to parallel ports.Output model ports (static metadata)
SimpleModelPort's have no associated metadata and therefore there is never any output metadata to declare.PMMLPort's, on the other hand, do have associated metadata. For all PMMLPorts, implementations must declare the following:- pmmlModelSpec: Implementations must declare the PMML model spec
by calling
PMMLPort.setPMMLModelSpec.
Output ports with dynamic metadata
If an output port has dynamic metadata, implementations can declare by callingIterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean). In the case that metadata is dynamic, calls toRecordPort#setType,RecordPort#setOutputDataOrdering, etc are not allowed and thus the sections above entitled "Output record ports (static metadata)" and "Output model ports (static metadata)" must be skipped. Note that, if possible, dynamic metadata should be avoided (seeIterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)).- Parameters:
ctx- the context
-
createIterator
Invoked at the start of execution. The iterator is expected to return a handle that is then used for execution.- Parameters:
ctx- a context in which the iterative operator can find input port metadata, etc. this information was available in the previous call tocomputeMetadata(IterativeMetadataContext), but is available here as well so that the iterative operator need not cache any metadata in its instance variables.- Returns:
- a handle that is used for iteration
-