com.pervasive.datarush.analytics.regression.LinearRegressionLearner

All Implemented Interfaces:: LogicalOperator

public class LinearRegressionLearner extends IterativeOperator

Performs a multivariate linear regression on the given training data. The output is a PMML model describing the resultant regression model. The model consists of the y-intercept and the coefficients for each of the given independent variables.

A dependent variable must be specified. This is a field in the input that is the target of the linear regression model. One or more independent variables are required from the input data.

This operator supports numeric as well as categorical data as input. The linear regression is performed using an Ordinary Least Squares (OLS) fit. Dummy Coding is used to handle categorical variables.

This approach requires for each of the categorical variables one value from its domain to be chosen that serves as reference for all other values in that domain during the computation of the model. Specifying reference values using operator's API is optional. If for a certain categorical variable no reference value is specified by the user, it will be randomly chosen.

The output is an estimate of coefficients for the model:

Y = a + (b₁*x₁ + ... + b_n*x_n) + (0*w₁_{_ref} + c_1,1*w_1,1+ ... + c_1,k₁*w_1,k₁+ ... + 0*w_m_{_ref} + c_m,1*w_m,1+ ... + c_{m,k_m}*w_{m,k_m})

where

a is the constant term (aka the intercept)
n is the number of numeric input variables
b_i; 0 < i ≤ n, is the coefficient for numerical input variable x_i
m is the number of categorical input variables
w_{i_ref}; 0 < i ≤ m, is the reference value of the categorical variable w_i
k_i;0 < i ≤ m, is the domain size of the categorical variable w_i
c_i,j; 0 < i ≤ m, 0 < j ≤ k_i, is the coefficient for the jth non-reference value w_i,j of the ith categorical input variable w_i

The following assumptions are made about the nature of input data:

Independent variables must be linearly independent from each other.
Dependent variable must be numerical (i.e. continuous and not discrete).
All variables loosely follow the normal distribution.

Field Summary

Fields

Modifier and Type

Field

Description

protected static final int

MAX_DOMAIN_SIZE

protected static final int

MIN_DOMAIN_SIZE
Constructor Summary

Constructors

Constructor

Description

LinearRegressionLearner()

Default constructor.

LinearRegressionLearner(String dependentVariable, String... independentVariables)

Constructor specifying the dependent variable and independent variables.
Method Summary

Modifier and Type

Method

Description

protected void

computeMetadata(IterativeMetadataContext context)

Implementations must adhere to the following contracts

protected CompositionIterator

createIterator(MetadataContext context)

Invoked at the start of execution.

String

getDependentVariable()

Get the field name of the dependent variable.

String[]

getIndependentVariables()

Get the field names of the independent variables.

RecordPort

getInput()

Get the input port of this operator.

PMMLPort

getOutput()

Get the output port of this operator.

Map<String,String>

getReferenceValues()

Get the reference values for the independent categorical variables as they were set using the corresponding setter method.

Double

getSingularityThreshold()

Get singularityThreshold value

void

setDependentVariable(String dependentVariable)

Set the field name of the dependent variable.

void

setIndependentVariables(String... independentVariables)

Set the field names of the independent variables.

void

setReferenceValues(Map<String,String> referenceValues)

Set reference values for the independent categorical variables.

void

setSingularityThreshold(Double singularityThresholdValue)

Set singularityThreshold value against which a matrix is considered singular or non singular.

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- MAX_DOMAIN_SIZE
  
  protected static final int MAX_DOMAIN_SIZE
  See Also:
  
  Constant Field Values
- MIN_DOMAIN_SIZE
  
  protected static final int MIN_DOMAIN_SIZE
  See Also:
  
  Constant Field Values
Constructor Details
- LinearRegressionLearner
  
  public LinearRegressionLearner()
  
  Default constructor. Use setDependentVariable(String) and setIndependentVariables(String...) to set the dependent and independent variables.
- LinearRegressionLearner
  
  public LinearRegressionLearner(String dependentVariable, String... independentVariables)
  
  Constructor specifying the dependent variable and independent variables.
  
  Parameters:
  
  dependentVariable - name of the dependent variable field
  
  independentVariables - names of the independent variable fields
Method Details
- getDependentVariable
  
  public String getDependentVariable()
  
  Get the field name of the dependent variable.
  
  Returns:
  
  dependent variable field name
- setDependentVariable
  
  public void setDependentVariable(String dependentVariable)
  
  Set the field name of the dependent variable.
  
  Parameters:
  
  dependentVariable - dependent variable field name
- getIndependentVariables
  
  public String[] getIndependentVariables()
  
  Get the field names of the independent variables.
  
  Returns:
  
  independent variable field names
- setIndependentVariables
  
  public void setIndependentVariables(String... independentVariables)
  
  Set the field names of the independent variables.
  
  Parameters:
  
  independentVariables - independent variable field names
- setReferenceValues
  
  public void setReferenceValues(Map<String,String> referenceValues)
  
  Set reference values for the independent categorical variables. If no reference value is provided for a certain variable, one randomly chosen value from its domain will be picked as reference.
  
  Parameters:
  
  referenceValues - mapping from independent categorical variable names to their reference values
- getReferenceValues
  
  public Map<String,String> getReferenceValues()
  
  Get the reference values for the independent categorical variables as they were set using the corresponding setter method.
  
  Returns:
  
  mapping from independent categorical variable names to their reference values
- setSingularityThreshold
  
  public void setSingularityThreshold(Double singularityThresholdValue)
  
  Set singularityThreshold value against which a matrix is considered singular or non singular.
  
  Parameters:
  
  singularityThresholdValue - Default bound to determine effective singularity in LU decomposition
- getSingularityThreshold
  
  public Double getSingularityThreshold()
  
  Get singularityThreshold value
  
  Parameters:
  
  singularityThreshold -
- getInput
  
  public RecordPort getInput()
  
  Get the input port of this operator.
  
  Returns:
  
  input port
- getOutput
  
  public PMMLPort getOutput()
  
  Get the output port of this operator. The port provides the linear regression PMML model generated for the input data.
  
  Returns:
  
  output PMML port
- computeMetadata
  
  protected void computeMetadata(IterativeMetadataContext context)
  
  Description copied from class: IterativeOperator
  Implementations must adhere to the following contracts
  General
  Regardless of input ports/output port types, all implementations must do the following:
  
  Validation. Validation of configuration should always be performed first.
  
  Declare operator parallelizability. Implementations must declare by calling IterativeMetadataContext.parallelize(ParallelismStrategy).
  Declare output port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setOutputParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
  
  Declare input port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setIterationParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean).
  
  NOTE: There is a convenience method for performing steps 2-4 for the case where all record ports are parallelizable and where we are to determine parallelism based on source:
  
  MetadataUtil#negotiateParallelismBasedOnSourceAssumingParallelizableRecords
  
  Input record ports
  Implementations with input record ports must declare the following:
  
  Required data ordering:
  Implementations that have data ordering requirements must declare them by calling RecordPort#setRequiredDataOrdering, otherwise iteration will proceed on an input dataset whose order is undefined.
  Required data distribution (only applies to parallelizable input ports):
  Implementations that have data distribution requirements must declare them by calling RecordPort#setRequiredDataDistribution, otherwise iteration will proceed on an input dataset whose distribution is the unspecified partial distribution.
  Note that if the upstream operator's output distribution/ordering is compatible with those required, we avoid a re-sort/re-distribution which is generally a very large savings from a performance standpoint.
  Output record ports (static metadata)
  Implementations with output record ports must declare the following:
  
  Type: Implementations must declare their output type by calling RecordPort#setType.
  
  Implementations with output record ports may declare the following:
  
  Output data ordering: Implementations that can make guarantees as to their output ordering may do so by calling RecordPort#setOutputDataOrdering
  
  Output data distribution (only applies to parallelizable output ports): Implementations that can make guarantees as to their output distribution may do so by calling RecordPort#setOutputDataDistribution
  
  Note that both of these properties are optional; if unspecified, performance may suffer since the framework may unnecessarily re-sort/re-distributed the data.
  Input model ports
  In general, iterative operators will tend not to have model input ports, but if so, there is nothing special to declare for input model ports. Models are implicitly duplicated to all partitions when going from non-parallel to parallel ports.
  Output model ports (static metadata)
  SimpleModelPort's have no associated metadata and therefore there is never any output metadata to declare. PMMLPort's, on the other hand, do have associated metadata. For all PMMLPorts, implementations must declare the following:
  
  pmmlModelSpec: Implementations must declare the PMML model spec by calling PMMLPort.setPMMLModelSpec.
  
  Output ports with dynamic metadata
  If an output port has dynamic metadata, implementations can declare by calling IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean). In the case that metadata is dynamic, calls to RecordPort#setType, RecordPort#setOutputDataOrdering, etc are not allowed and thus the sections above entitled "Output record ports (static metadata)" and "Output model ports (static metadata)" must be skipped. Note that, if possible, dynamic metadata should be avoided (see IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)).
  Specified by:
  
  computeMetadata in class IterativeOperator
  
  Parameters:
  
  context - the context
- createIterator
  
  protected CompositionIterator createIterator(MetadataContext context)
  
  Description copied from class: IterativeOperator
  
  Invoked at the start of execution. The iterator is expected to return a handle that is then used for execution.
  
  Specified by:
  
  createIterator in class IterativeOperator
  
  Parameters:
  
  context - a context in which the iterative operator can find input port metadata, etc. this information was available in the previous call to IterativeOperator.computeMetadata(IterativeMetadataContext), but is available here as well so that the iterative operator need not cache any metadata in its instance variables.
  
  Returns:
  
  a handle that is used for iteration

Class LinearRegressionLearner

Field Summary

Constructor Summary

Method Summary

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator

Methods inherited from class java.lang.Object

Field Details

MAX_DOMAIN_SIZE

MIN_DOMAIN_SIZE

Constructor Details

LinearRegressionLearner

LinearRegressionLearner

Method Details

getDependentVariable

setDependentVariable

getIndependentVariables

setIndependentVariables

setReferenceValues

getReferenceValues

setSingularityThreshold

getSingularityThreshold

getInput

getOutput

computeMetadata

General

Input record ports

Output record ports (static metadata)

Input model ports

Output model ports (static metadata)

Output ports with dynamic metadata

createIterator