Class RunRScript

All Implemented Interfaces:
LogicalOperator, PipelineOperator<RecordPort>, RecordPipelineOperator

public class RunRScript extends CompositeOperator implements RecordPipelineOperator
Execute an R script in flow. The input data will be presented to the R script as the data frame named "R". The data frame with the same name ("R") will be used to write to the output. CSV format is used to import data into R and export data from R. The auto-discovery mechanism of R when reading CSV data is relied upon to handle data typing.

This operator is parallelized by default. This implies that multiple R instances are initialized and the given script is executed within each instance with a different segment of the input data. Disable parallelism for the operator if any data dependencies exist due to R functions used in the script.

Two variables are initially set within the R environment prior to execution within each operator instance:

  • partitionID: zero-based identifier of the instance partition
  • partitionCount: the total number of data partitions created to execute the current graph
The variables can be used to differentiate the current partition of the current instance and the overall partition count. This information can be used within the script to base the script behavior on the execution environment.

Use the setPathToRScript(String) method to set the path to the Rscript executable. This executable is used instead of the R executable as it is meant to run a single R script specified on the command line.

  • Constructor Details

    • RunRScript

      public RunRScript()
  • Method Details

    • getInput

      public RecordPort getInput()
      Description copied from interface: PipelineOperator
      Returns the input port
      Specified by:
      getInput in interface PipelineOperator<RecordPort>
      Returns:
      the input port
    • getOutput

      public RecordPort getOutput()
      Description copied from interface: PipelineOperator
      Returns the output port
      Specified by:
      getOutput in interface PipelineOperator<RecordPort>
      Returns:
      the output port
    • getOutputType

      public RecordTokenType getOutputType()
      Get the configured output type.
      Returns:
      output record token type
    • setOutputType

      public void setOutputType(RecordTokenType outputType)
      Configure the output token type. This is a required property. The type cannot be discovered since the output is determined by the code of the given script.
      Parameters:
      outputType - output record token type
    • getPathToRScript

      public String getPathToRScript()
      Get the path to the Rscript binary within the local R installation.
      Returns:
      path to RScript executable
    • setPathToRScript

      public void setPathToRScript(String pathToRScript)
      Set the path to the Rscript binary within the local R installation. This executable is normally found in the bin directory of the R installation. If this operator is executed on a cluster, R must be installed on each machine within the cluster in a consistent location.
      Parameters:
      pathToRScript - path to RScript executable
    • getScriptSnippet

      public String getScriptSnippet()
      Get the R script to execute.
      Returns:
      R script
    • setScriptSnippet

      public void setScriptSnippet(String scriptSnippet)
      Set the R script to execute.
      Parameters:
      scriptSnippet -
    • setRequiredDataDistribution

      public void setRequiredDataDistribution(DataDistribution dataDistribution)
      Set the data distribution requirements for the input port of this operator. Since this scripting operator has no way to determine what a script may do, specifying the required data distribution allows the operator user to force the distribution requirements of the input port.
      Parameters:
      dataDistribution - input data distribution
    • getRequiredDataDistribution

      public DataDistribution getRequiredDataDistribution()
      Get the value set for the required data distribution of the input port of this operator.
      Returns:
      required data distribution property (may be null)
    • setCharset

      public void setCharset(String charsetName)
      Set character set value which is used to format input and output data for R. Default value is set to UTF-8.
      Parameters:
      charsetName - character set name for encoding
    • getCharset

      public String getCharset()
      Get character set value for input and output data.
      Returns:
      character set
    • compose

      protected void compose(CompositionContext ctx)
      Description copied from class: CompositeOperator
      Compose the body of this operator. Implementations should do the following:
      1. Perform any validation of configuration, input types, etc
      2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
      3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
      Specified by:
      compose in class CompositeOperator
      Parameters:
      ctx - the context
    • validateFieldName

      public static String validateFieldName(int position, String fldName)
      Function to ensure an input/output field name conforms to R variable identifier requirements. If the field name is not valid, performs the following standardization on the field to produce a valid identifier. This is provided as a convenience for proper identifier formatting within scripts. Input and output schemas set on this operator undergo name validation so this is a mechanism that can be used to ensure identifier referencing is consistent.
      Parameters:
      position - the ordinal position of the field in the source/target schema - may be needed to disambiguate field names after character substitution takes place.
      fldName - the name of the field being validated
      Returns:
      an R compliant variable identifier
    • getIdentifierRegex

      public static String getIdentifierRegex()
      Function to return a Regular Expression string that can be used to identify a valid R identifier. The string returned here is used internally for the validateFieldName function.
      Returns:
      the Regular Expression string used to identify an R compliant identifier