Class LinearRegressionLearner

  • All Implemented Interfaces:
    LogicalOperator

    public class LinearRegressionLearner
    extends IterativeOperator
    Performs a multivariate linear regression on the given training data. The output is a PMML model describing the resultant regression model. The model consists of the y-intercept and the coefficients for each of the given independent variables.

    A dependent variable must be specified. This is a field in the input that is the target of the linear regression model. One or more independent variables are required from the input data.

    This operator supports numeric as well as categorical data as input. The linear regression is performed using an Ordinary Least Squares (OLS) fit. Dummy Coding is used to handle categorical variables.

    This approach requires for each of the categorical variables one value from its domain to be chosen that serves as reference for all other values in that domain during the computation of the model. Specifying reference values using operator's API is optional. If for a certain categorical variable no reference value is specified by the user, it will be randomly chosen.

    The output is an estimate of coefficients for the model:

    Y = a + (b1*x1 + ... + bn*xn) + (0*w1ref + c1,1*w1,1+ ... + c1,k1*w1,k1 + ... + 0*wmref + cm,1*wm,1+ ... + cm,km*wm,km)

    where

    • a is the constant term (aka the intercept)
    • n is the number of numeric input variables
    • bi; 0 < i ≤ n, is the coefficient for numerical input variable xi
    • m is the number of categorical input variables
    • wiref; 0 < i ≤ m, is the reference value of the categorical variable wi
    • ki;0 < i ≤ m, is the domain size of the categorical variable wi
    • ci,j; 0 < i ≤ m, 0 < j ≤ ki, is the coefficient for the jth non-reference value wi,j of the ith categorical input variable wi

    The following assumptions are made about the nature of input data:

    • Independent variables must be linearly independent from each other.
    • Dependent variable must be numerical (i.e. continuous and not discrete).
    • All variables loosely follow the normal distribution.
    • Constructor Detail

      • LinearRegressionLearner

        public LinearRegressionLearner​(String dependentVariable,
                                       String... independentVariables)
        Constructor specifying the dependent variable and independent variables.
        Parameters:
        dependentVariable - name of the dependent variable field
        independentVariables - names of the independent variable fields
    • Method Detail

      • getDependentVariable

        public String getDependentVariable()
        Get the field name of the dependent variable.
        Returns:
        dependent variable field name
      • setDependentVariable

        public void setDependentVariable​(String dependentVariable)
        Set the field name of the dependent variable.
        Parameters:
        dependentVariable - dependent variable field name
      • getIndependentVariables

        public String[] getIndependentVariables()
        Get the field names of the independent variables.
        Returns:
        independent variable field names
      • setIndependentVariables

        public void setIndependentVariables​(String... independentVariables)
        Set the field names of the independent variables.
        Parameters:
        independentVariables - independent variable field names
      • setReferenceValues

        public void setReferenceValues​(Map<String,​String> referenceValues)
        Set reference values for the independent categorical variables. If no reference value is provided for a certain variable, one randomly chosen value from its domain will be picked as reference.
        Parameters:
        referenceValues - mapping from independent categorical variable names to their reference values
      • getReferenceValues

        public Map<String,​String> getReferenceValues()
        Get the reference values for the independent categorical variables as they were set using the corresponding setter method.
        Returns:
        mapping from independent categorical variable names to their reference values
      • setSingularityThreshold

        public void setSingularityThreshold​(Double singularityThresholdValue)
        Set singularityThreshold value against which a matrix is considered singular or non singular.
        Parameters:
        singularityThresholdValue - Default bound to determine effective singularity in LU decomposition
      • getSingularityThreshold

        public Double getSingularityThreshold()
        Get singularityThreshold value
        Parameters:
        singularityThreshold -
      • getInput

        public RecordPort getInput()
        Get the input port of this operator.
        Returns:
        input port
      • getOutput

        public PMMLPort getOutput()
        Get the output port of this operator. The port provides the linear regression PMML model generated for the input data.
        Returns:
        output PMML port
      • computeMetadata

        protected void computeMetadata​(IterativeMetadataContext context)
        Description copied from class: IterativeOperator
        Implementations must adhere to the following contracts

        General

        Regardless of input ports/output port types, all implementations must do the following:

        1. Validation. Validation of configuration should always be performed first.
        2. Declare operator parallelizability. Implementations must declare by calling IterativeMetadataContext.parallelize(ParallelismStrategy).
        3. Declare output port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setOutputParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
        4. Declare input port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setIterationParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean).
        NOTE: There is a convenience method for performing steps 2-4 for the case where all record ports are parallelizable and where we are to determine parallelism based on source:
        • MetadataUtil#negotiateParallelismBasedOnSourceAssumingParallelizableRecords

        Input record ports

        Implementations with input record ports must declare the following:
        1. Required data ordering:
        2. Implementations that have data ordering requirements must declare them by calling RecordPort#setRequiredDataOrdering, otherwise iteration will proceed on an input dataset whose order is undefined.
        3. Required data distribution (only applies to parallelizable input ports):
        4. Implementations that have data distribution requirements must declare them by calling RecordPort#setRequiredDataDistribution, otherwise iteration will proceed on an input dataset whose distribution is the unspecified partial distribution.
        Note that if the upstream operator's output distribution/ordering is compatible with those required, we avoid a re-sort/re-distribution which is generally a very large savings from a performance standpoint.

        Output record ports (static metadata)

        Implementations with output record ports must declare the following:
        1. Type: Implementations must declare their output type by calling RecordPort#setType.
        Implementations with output record ports may declare the following:
        1. Output data ordering: Implementations that can make guarantees as to their output ordering may do so by calling RecordPort#setOutputDataOrdering
        2. Output data distribution (only applies to parallelizable output ports): Implementations that can make guarantees as to their output distribution may do so by calling RecordPort#setOutputDataDistribution
        Note that both of these properties are optional; if unspecified, performance may suffer since the framework may unnecessarily re-sort/re-distributed the data.

        Input model ports

        In general, iterative operators will tend not to have model input ports, but if so, there is nothing special to declare for input model ports. Models are implicitly duplicated to all partitions when going from non-parallel to parallel ports.

        Output model ports (static metadata)

        SimpleModelPort's have no associated metadata and therefore there is never any output metadata to declare. PMMLPort's, on the other hand, do have associated metadata. For all PMMLPorts, implementations must declare the following:
        1. pmmlModelSpec: Implementations must declare the PMML model spec by calling PMMLPort.setPMMLModelSpec.

        Output ports with dynamic metadata

        If an output port has dynamic metadata, implementations can declare by calling IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean). In the case that metadata is dynamic, calls to RecordPort#setType, RecordPort#setOutputDataOrdering, etc are not allowed and thus the sections above entitled "Output record ports (static metadata)" and "Output model ports (static metadata)" must be skipped. Note that, if possible, dynamic metadata should be avoided (see IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)).

        Specified by:
        computeMetadata in class IterativeOperator
        Parameters:
        context - the context