Class RunRScript

  • All Implemented Interfaces:
    LogicalOperator, PipelineOperator<RecordPort>, RecordPipelineOperator

    public class RunRScript
    extends CompositeOperator
    implements RecordPipelineOperator
    Execute an R script in flow. The input data will be presented to the R script as the data frame named "R". The data frame with the same name ("R") will be used to write to the output. CSV format is used to import data into R and export data from R. The auto-discovery mechanism of R when reading CSV data is relied upon to handle data typing.

    This operator is parallelized by default. This implies that multiple R instances are initialized and the given script is executed within each instance with a different segment of the input data. Disable parallelism for the operator if any data dependencies exist due to R functions used in the script.

    Two variables are initially set within the R environment prior to execution within each operator instance:

    • partitionID: zero-based identifier of the instance partition
    • partitionCount: the total number of data partitions created to execute the current graph
    The variables can be used to differentiate the current partition of the current instance and the overall partition count. This information can be used within the script to base the script behavior on the execution environment.

    Use the setPathToRScript(String) method to set the path to the Rscript executable. This executable is used instead of the R executable as it is meant to run a single R script specified on the command line.

    • Constructor Detail

      • RunRScript

        public RunRScript()
    • Method Detail

      • getOutputType

        public RecordTokenType getOutputType()
        Get the configured output type.
        Returns:
        output record token type
      • setOutputType

        public void setOutputType​(RecordTokenType outputType)
        Configure the output token type. This is a required property. The type cannot be discovered since the output is determined by the code of the given script.
        Parameters:
        outputType - output record token type
      • getPathToRScript

        public String getPathToRScript()
        Get the path to the Rscript binary within the local R installation.
        Returns:
        path to RScript executable
      • setPathToRScript

        public void setPathToRScript​(String pathToRScript)
        Set the path to the Rscript binary within the local R installation. This executable is normally found in the bin directory of the R installation. If this operator is executed on a cluster, R must be installed on each machine within the cluster in a consistent location.
        Parameters:
        pathToRScript - path to RScript executable
      • getScriptSnippet

        public String getScriptSnippet()
        Get the R script to execute.
        Returns:
        R script
      • setScriptSnippet

        public void setScriptSnippet​(String scriptSnippet)
        Set the R script to execute.
        Parameters:
        scriptSnippet -
      • setRequiredDataDistribution

        public void setRequiredDataDistribution​(DataDistribution dataDistribution)
        Set the data distribution requirements for the input port of this operator. Since this scripting operator has no way to determine what a script may do, specifying the required data distribution allows the operator user to force the distribution requirements of the input port.
        Parameters:
        dataDistribution - input data distribution
      • getRequiredDataDistribution

        public DataDistribution getRequiredDataDistribution()
        Get the value set for the required data distribution of the input port of this operator.
        Returns:
        required data distribution property (may be null)
      • setCharset

        public void setCharset​(String charsetName)
        Set character set value which is used to format input and output data for R. Default value is set to UTF-8.
        Parameters:
        charsetName - character set name for encoding
      • getCharset

        public String getCharset()
        Get character set value for input and output data.
        Returns:
        character set
      • compose

        protected void compose​(CompositionContext ctx)
        Description copied from class: CompositeOperator
        Compose the body of this operator. Implementations should do the following:
        1. Perform any validation of configuration, input types, etc
        2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
        3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
        Specified by:
        compose in class CompositeOperator
        Parameters:
        ctx - the context
      • validateFieldName

        public static String validateFieldName​(int position,
                                               String fldName)
        Function to ensure an input/output field name conforms to R variable identifier requirements. If the field name is not valid, performs the following standardization on the field to produce a valid identifier. This is provided as a convenience for proper identifier formatting within scripts. Input and output schemas set on this operator undergo name validation so this is a mechanism that can be used to ensure identifier referencing is consistent.
        Parameters:
        position - the ordinal position of the field in the source/target schema - may be needed to disambiguate field names after character substitution takes place.
        fldName - the name of the field being validated
        Returns:
        an R compliant variable identifier
      • getIdentifierRegex

        public static String getIdentifierRegex()
        Function to return a Regular Expression string that can be used to identify a valid R identifier. The string returned here is used internally for the validateFieldName function.
        Returns:
        the Regular Expression string used to identify an R compliant identifier