Class LinearRegressionLearner
- java.lang.Object
-
- com.pervasive.datarush.operators.AbstractLogicalOperator
-
- com.pervasive.datarush.operators.IterativeOperator
-
- com.pervasive.datarush.analytics.regression.LinearRegressionLearner
-
- All Implemented Interfaces:
LogicalOperator
public class LinearRegressionLearner extends IterativeOperator
Performs a multivariate linear regression on the given training data. The output is a PMML model describing the resultant regression model. The model consists of the y-intercept and the coefficients for each of the given independent variables.A dependent variable must be specified. This is a field in the input that is the target of the linear regression model. One or more independent variables are required from the input data.
This operator supports numeric as well as categorical data as input. The linear regression is performed using an Ordinary Least Squares (OLS) fit. Dummy Coding is used to handle categorical variables.
This approach requires for each of the categorical variables one value from its domain to be chosen that serves as reference for all other values in that domain during the computation of the model. Specifying reference values using operator's API is optional. If for a certain categorical variable no reference value is specified by the user, it will be randomly chosen.
The output is an estimate of coefficients for the model:
Y = a + (b1*x1 + ... + bn*xn) + (0*w1ref + c1,1*w1,1+ ... + c1,k1*w1,k1 + ... + 0*wmref + cm,1*wm,1+ ... + cm,km*wm,km)
where
- a is the constant term (aka the intercept)
- n is the number of numeric input variables
- bi; 0 < i ≤ n, is the coefficient for numerical input variable xi
- m is the number of categorical input variables
- wiref; 0 < i ≤ m, is the reference value of the categorical variable wi
- ki;0 < i ≤ m, is the domain size of the categorical variable wi
- ci,j; 0 < i ≤ m, 0 < j ≤ ki, is the coefficient for the jth non-reference value wi,j of the ith categorical input variable wi
The following assumptions are made about the nature of input data:
- Independent variables must be linearly independent from each other.
- Dependent variable must be numerical (i.e. continuous and not discrete).
- All variables loosely follow the normal distribution.
-
-
Field Summary
Fields Modifier and Type Field Description protected static int
MAX_DOMAIN_SIZE
protected static int
MIN_DOMAIN_SIZE
-
Constructor Summary
Constructors Constructor Description LinearRegressionLearner()
Default constructor.LinearRegressionLearner(String dependentVariable, String... independentVariables)
Constructor specifying the dependent variable and independent variables.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
computeMetadata(IterativeMetadataContext context)
Implementations must adhere to the following contractsprotected CompositionIterator
createIterator(MetadataContext context)
Invoked at the start of execution.String
getDependentVariable()
Get the field name of the dependent variable.String[]
getIndependentVariables()
Get the field names of the independent variables.RecordPort
getInput()
Get the input port of this operator.PMMLPort
getOutput()
Get the output port of this operator.Map<String,String>
getReferenceValues()
Get the reference values for the independent categorical variables as they were set using the corresponding setter method.Double
getSingularityThreshold()
Get singularityThreshold valuevoid
setDependentVariable(String dependentVariable)
Set the field name of the dependent variable.void
setIndependentVariables(String... independentVariables)
Set the field names of the independent variables.void
setReferenceValues(Map<String,String> referenceValues)
Set reference values for the independent categorical variables.void
setSingularityThreshold(Double singularityThresholdValue)
Set singularityThreshold value against which a matrix is considered singular or non singular.-
Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
-
-
-
Field Detail
-
MAX_DOMAIN_SIZE
protected static final int MAX_DOMAIN_SIZE
- See Also:
- Constant Field Values
-
MIN_DOMAIN_SIZE
protected static final int MIN_DOMAIN_SIZE
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
LinearRegressionLearner
public LinearRegressionLearner()
Default constructor. UsesetDependentVariable(String)
andsetIndependentVariables(String...)
to set the dependent and independent variables.
-
LinearRegressionLearner
public LinearRegressionLearner(String dependentVariable, String... independentVariables)
Constructor specifying the dependent variable and independent variables.- Parameters:
dependentVariable
- name of the dependent variable fieldindependentVariables
- names of the independent variable fields
-
-
Method Detail
-
getDependentVariable
public String getDependentVariable()
Get the field name of the dependent variable.- Returns:
- dependent variable field name
-
setDependentVariable
public void setDependentVariable(String dependentVariable)
Set the field name of the dependent variable.- Parameters:
dependentVariable
- dependent variable field name
-
getIndependentVariables
public String[] getIndependentVariables()
Get the field names of the independent variables.- Returns:
- independent variable field names
-
setIndependentVariables
public void setIndependentVariables(String... independentVariables)
Set the field names of the independent variables.- Parameters:
independentVariables
- independent variable field names
-
setReferenceValues
public void setReferenceValues(Map<String,String> referenceValues)
Set reference values for the independent categorical variables. If no reference value is provided for a certain variable, one randomly chosen value from its domain will be picked as reference.- Parameters:
referenceValues
- mapping from independent categorical variable names to their reference values
-
getReferenceValues
public Map<String,String> getReferenceValues()
Get the reference values for the independent categorical variables as they were set using the corresponding setter method.- Returns:
- mapping from independent categorical variable names to their reference values
-
setSingularityThreshold
public void setSingularityThreshold(Double singularityThresholdValue)
Set singularityThreshold value against which a matrix is considered singular or non singular.- Parameters:
singularityThresholdValue
- Default bound to determine effective singularity in LU decomposition
-
getSingularityThreshold
public Double getSingularityThreshold()
Get singularityThreshold value- Parameters:
singularityThreshold
-
-
getInput
public RecordPort getInput()
Get the input port of this operator.- Returns:
- input port
-
getOutput
public PMMLPort getOutput()
Get the output port of this operator. The port provides the linear regression PMML model generated for the input data.- Returns:
- output PMML port
-
computeMetadata
protected void computeMetadata(IterativeMetadataContext context)
Description copied from class:IterativeOperator
Implementations must adhere to the following contractsGeneral
Regardless of input ports/output port types, all implementations must do the following:- Validation. Validation of configuration should always be performed first.
- Declare operator parallelizability. Implementations must declare by calling
IterativeMetadataContext.parallelize(ParallelismStrategy)
. - Declare output port parallelizablility. Implementations must declare by calling
IterativeMetadataContext.setOutputParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
- Declare input port parallelizablility. Implementations must declare by calling
IterativeMetadataContext.setIterationParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
.
MetadataUtil#negotiateParallelismBasedOnSourceAssumingParallelizableRecords
Input record ports
Implementations with input record ports must declare the following:- Required data ordering: Implementations that have data ordering requirements must declare them by calling
- Required data distribution (only applies to parallelizable input ports): Implementations that have data distribution requirements must declare them by calling
RecordPort#setRequiredDataOrdering
, otherwise iteration will proceed on an input dataset whose order is undefined.RecordPort#setRequiredDataDistribution
, otherwise iteration will proceed on an input dataset whose distribution is theunspecified partial distribution
.Output record ports (static metadata)
Implementations with output record ports must declare the following:- Type: Implementations must declare their output type by calling
RecordPort#setType
.
- Output data ordering: Implementations that can make guarantees as to their output
ordering may do so by calling
RecordPort#setOutputDataOrdering
- Output data distribution (only applies to parallelizable output ports): Implementations that can make guarantees as to their output
distribution may do so by calling
RecordPort#setOutputDataDistribution
Input model ports
In general, iterative operators will tend not to have model input ports, but if so, there is nothing special to declare for input model ports. Models are implicitly duplicated to all partitions when going from non-parallel to parallel ports.Output model ports (static metadata)
SimpleModelPort
's have no associated metadata and therefore there is never any output metadata to declare.PMMLPort
's, on the other hand, do have associated metadata. For all PMMLPorts, implementations must declare the following:- pmmlModelSpec: Implementations must declare the PMML model spec
by calling
PMMLPort.setPMMLModelSpec
.
Output ports with dynamic metadata
If an output port has dynamic metadata, implementations can declare by callingIterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)
. In the case that metadata is dynamic, calls toRecordPort#setType
,RecordPort#setOutputDataOrdering
, etc are not allowed and thus the sections above entitled "Output record ports (static metadata)" and "Output model ports (static metadata)" must be skipped. Note that, if possible, dynamic metadata should be avoided (seeIterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)
).- Specified by:
computeMetadata
in classIterativeOperator
- Parameters:
context
- the context
-
createIterator
protected CompositionIterator createIterator(MetadataContext context)
Description copied from class:IterativeOperator
Invoked at the start of execution. The iterator is expected to return a handle that is then used for execution.- Specified by:
createIterator
in classIterativeOperator
- Parameters:
context
- a context in which the iterative operator can find input port metadata, etc. this information was available in the previous call toIterativeOperator.computeMetadata(IterativeMetadataContext)
, but is available here as well so that the iterative operator need not cache any metadata in its instance variables.- Returns:
- a handle that is used for iteration
-
-