Class LinearRegressionLearner

All Implemented Interfaces:
LogicalOperator

public class LinearRegressionLearner extends IterativeOperator
Performs a multivariate linear regression on the given training data. The output is a PMML model describing the resultant regression model. The model consists of the y-intercept and the coefficients for each of the given independent variables.

A dependent variable must be specified. This is a field in the input that is the target of the linear regression model. One or more independent variables are required from the input data.

This operator supports numeric as well as categorical data as input. The linear regression is performed using an Ordinary Least Squares (OLS) fit. Dummy Coding is used to handle categorical variables.

This approach requires for each of the categorical variables one value from its domain to be chosen that serves as reference for all other values in that domain during the computation of the model. Specifying reference values using operator's API is optional. If for a certain categorical variable no reference value is specified by the user, it will be randomly chosen.

The output is an estimate of coefficients for the model:

Y = a + (b1*x1 + ... + bn*xn) + (0*w1ref + c1,1*w1,1+ ... + c1,k1*w1,k1 + ... + 0*wmref + cm,1*wm,1+ ... + cm,km*wm,km)

where

  • a is the constant term (aka the intercept)
  • n is the number of numeric input variables
  • bi; 0 < i ≤ n, is the coefficient for numerical input variable xi
  • m is the number of categorical input variables
  • wiref; 0 < i ≤ m, is the reference value of the categorical variable wi
  • ki;0 < i ≤ m, is the domain size of the categorical variable wi
  • ci,j; 0 < i ≤ m, 0 < j ≤ ki, is the coefficient for the jth non-reference value wi,j of the ith categorical input variable wi

The following assumptions are made about the nature of input data:

  • Independent variables must be linearly independent from each other.
  • Dependent variable must be numerical (i.e. continuous and not discrete).
  • All variables loosely follow the normal distribution.
  • Field Details

  • Constructor Details

    • LinearRegressionLearner

      public LinearRegressionLearner()
      Default constructor. Use setDependentVariable(String) and setIndependentVariables(String...) to set the dependent and independent variables.
    • LinearRegressionLearner

      public LinearRegressionLearner(String dependentVariable, String... independentVariables)
      Constructor specifying the dependent variable and independent variables.
      Parameters:
      dependentVariable - name of the dependent variable field
      independentVariables - names of the independent variable fields
  • Method Details

    • getDependentVariable

      public String getDependentVariable()
      Get the field name of the dependent variable.
      Returns:
      dependent variable field name
    • setDependentVariable

      public void setDependentVariable(String dependentVariable)
      Set the field name of the dependent variable.
      Parameters:
      dependentVariable - dependent variable field name
    • getIndependentVariables

      public String[] getIndependentVariables()
      Get the field names of the independent variables.
      Returns:
      independent variable field names
    • setIndependentVariables

      public void setIndependentVariables(String... independentVariables)
      Set the field names of the independent variables.
      Parameters:
      independentVariables - independent variable field names
    • setReferenceValues

      public void setReferenceValues(Map<String,String> referenceValues)
      Set reference values for the independent categorical variables. If no reference value is provided for a certain variable, one randomly chosen value from its domain will be picked as reference.
      Parameters:
      referenceValues - mapping from independent categorical variable names to their reference values
    • getReferenceValues

      public Map<String,String> getReferenceValues()
      Get the reference values for the independent categorical variables as they were set using the corresponding setter method.
      Returns:
      mapping from independent categorical variable names to their reference values
    • setSingularityThreshold

      public void setSingularityThreshold(Double singularityThresholdValue)
      Set singularityThreshold value against which a matrix is considered singular or non singular.
      Parameters:
      singularityThresholdValue - Default bound to determine effective singularity in LU decomposition
    • getSingularityThreshold

      public Double getSingularityThreshold()
      Get singularityThreshold value
      Parameters:
      singularityThreshold -
    • getInput

      public RecordPort getInput()
      Get the input port of this operator.
      Returns:
      input port
    • getOutput

      public PMMLPort getOutput()
      Get the output port of this operator. The port provides the linear regression PMML model generated for the input data.
      Returns:
      output PMML port
    • computeMetadata

      protected void computeMetadata(IterativeMetadataContext context)
      Description copied from class: IterativeOperator
      Implementations must adhere to the following contracts

      General

      Regardless of input ports/output port types, all implementations must do the following:

      1. Validation. Validation of configuration should always be performed first.
      2. Declare operator parallelizability. Implementations must declare by calling IterativeMetadataContext.parallelize(ParallelismStrategy).
      3. Declare output port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setOutputParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
      4. Declare input port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setIterationParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean).
      NOTE: There is a convenience method for performing steps 2-4 for the case where all record ports are parallelizable and where we are to determine parallelism based on source:
      • MetadataUtil#negotiateParallelismBasedOnSourceAssumingParallelizableRecords

      Input record ports

      Implementations with input record ports must declare the following:
      1. Required data ordering:
      2. Implementations that have data ordering requirements must declare them by calling RecordPort#setRequiredDataOrdering, otherwise iteration will proceed on an input dataset whose order is undefined.
      3. Required data distribution (only applies to parallelizable input ports):
      4. Implementations that have data distribution requirements must declare them by calling RecordPort#setRequiredDataDistribution, otherwise iteration will proceed on an input dataset whose distribution is the unspecified partial distribution.
      Note that if the upstream operator's output distribution/ordering is compatible with those required, we avoid a re-sort/re-distribution which is generally a very large savings from a performance standpoint.

      Output record ports (static metadata)

      Implementations with output record ports must declare the following:
      1. Type: Implementations must declare their output type by calling RecordPort#setType.
      Implementations with output record ports may declare the following:
      1. Output data ordering: Implementations that can make guarantees as to their output ordering may do so by calling RecordPort#setOutputDataOrdering
      2. Output data distribution (only applies to parallelizable output ports): Implementations that can make guarantees as to their output distribution may do so by calling RecordPort#setOutputDataDistribution
      Note that both of these properties are optional; if unspecified, performance may suffer since the framework may unnecessarily re-sort/re-distributed the data.

      Input model ports

      In general, iterative operators will tend not to have model input ports, but if so, there is nothing special to declare for input model ports. Models are implicitly duplicated to all partitions when going from non-parallel to parallel ports.

      Output model ports (static metadata)

      SimpleModelPort's have no associated metadata and therefore there is never any output metadata to declare. PMMLPort's, on the other hand, do have associated metadata. For all PMMLPorts, implementations must declare the following:
      1. pmmlModelSpec: Implementations must declare the PMML model spec by calling PMMLPort.setPMMLModelSpec.

      Output ports with dynamic metadata

      If an output port has dynamic metadata, implementations can declare by calling IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean). In the case that metadata is dynamic, calls to RecordPort#setType, RecordPort#setOutputDataOrdering, etc are not allowed and thus the sections above entitled "Output record ports (static metadata)" and "Output model ports (static metadata)" must be skipped. Note that, if possible, dynamic metadata should be avoided (see IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)).

      Specified by:
      computeMetadata in class IterativeOperator
      Parameters:
      context - the context
    • createIterator

      protected CompositionIterator createIterator(MetadataContext context)
      Description copied from class: IterativeOperator
      Invoked at the start of execution. The iterator is expected to return a handle that is then used for execution.
      Specified by:
      createIterator in class IterativeOperator
      Parameters:
      context - a context in which the iterative operator can find input port metadata, etc. this information was available in the previous call to IterativeOperator.computeMetadata(IterativeMetadataContext), but is available here as well so that the iterative operator need not cache any metadata in its instance variables.
      Returns:
      a handle that is used for iteration