- java.lang.Object
-
- com.pervasive.datarush.operators.AbstractLogicalOperator
-
- com.pervasive.datarush.operators.CompositeOperator
-
- com.pervasive.datarush.analytics.r.RunRScript
-
- All Implemented Interfaces:
LogicalOperator
,PipelineOperator<RecordPort>
,RecordPipelineOperator
public class RunRScript extends CompositeOperator implements RecordPipelineOperator
Execute an R script in flow. The input data will be presented to the R script as the data frame named "R". The data frame with the same name ("R") will be used to write to the output. CSV format is used to import data into R and export data from R. The auto-discovery mechanism of R when reading CSV data is relied upon to handle data typing.This operator is parallelized by default. This implies that multiple R instances are initialized and the given script is executed within each instance with a different segment of the input data. Disable parallelism for the operator if any data dependencies exist due to R functions used in the script.
Two variables are initially set within the R environment prior to execution within each operator instance:
- partitionID: zero-based identifier of the instance partition
- partitionCount: the total number of data partitions created to execute the current graph
Use the
setPathToRScript(String)
method to set the path to the Rscript executable. This executable is used instead of theR
executable as it is meant to run a single R script specified on the command line.
-
-
Constructor Summary
Constructors Constructor Description RunRScript()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
compose(CompositionContext ctx)
Compose the body of this operator.String
getCharset()
Get character set value for input and output data.static String
getIdentifierRegex()
Function to return a Regular Expression string that can be used to identify a valid R identifier.RecordPort
getInput()
Returns the input portRecordPort
getOutput()
Returns the output portRecordTokenType
getOutputType()
Get the configured output type.String
getPathToRScript()
Get the path to the Rscript binary within the local R installation.DataDistribution
getRequiredDataDistribution()
Get the value set for the required data distribution of the input port of this operator.String
getScriptSnippet()
Get the R script to execute.void
setCharset(String charsetName)
Set character set value which is used to format input and output data for R.void
setOutputType(RecordTokenType outputType)
Configure the output token type.void
setPathToRScript(String pathToRScript)
Set the path to the Rscript binary within the local R installation.void
setRequiredDataDistribution(DataDistribution dataDistribution)
Set the data distribution requirements for the input port of this operator.void
setScriptSnippet(String scriptSnippet)
Set the R script to execute.static String
validateFieldName(int position, String fldName)
Function to ensure an input/output field name conforms to R variable identifier requirements.-
Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.pervasive.datarush.operators.LogicalOperator
disableParallelism, getInputPorts, getOutputPorts
-
-
-
-
Method Detail
-
getInput
public RecordPort getInput()
Description copied from interface:PipelineOperator
Returns the input port- Specified by:
getInput
in interfacePipelineOperator<RecordPort>
- Returns:
- the input port
-
getOutput
public RecordPort getOutput()
Description copied from interface:PipelineOperator
Returns the output port- Specified by:
getOutput
in interfacePipelineOperator<RecordPort>
- Returns:
- the output port
-
getOutputType
public RecordTokenType getOutputType()
Get the configured output type.- Returns:
- output record token type
-
setOutputType
public void setOutputType(RecordTokenType outputType)
Configure the output token type. This is a required property. The type cannot be discovered since the output is determined by the code of the given script.- Parameters:
outputType
- output record token type
-
getPathToRScript
public String getPathToRScript()
Get the path to the Rscript binary within the local R installation.- Returns:
- path to RScript executable
-
setPathToRScript
public void setPathToRScript(String pathToRScript)
Set the path to the Rscript binary within the local R installation. This executable is normally found in the bin directory of the R installation. If this operator is executed on a cluster, R must be installed on each machine within the cluster in a consistent location.- Parameters:
pathToRScript
- path to RScript executable
-
getScriptSnippet
public String getScriptSnippet()
Get the R script to execute.- Returns:
- R script
-
setScriptSnippet
public void setScriptSnippet(String scriptSnippet)
Set the R script to execute.- Parameters:
scriptSnippet
-
-
setRequiredDataDistribution
public void setRequiredDataDistribution(DataDistribution dataDistribution)
Set the data distribution requirements for the input port of this operator. Since this scripting operator has no way to determine what a script may do, specifying the required data distribution allows the operator user to force the distribution requirements of the input port.- Parameters:
dataDistribution
- input data distribution
-
getRequiredDataDistribution
public DataDistribution getRequiredDataDistribution()
Get the value set for the required data distribution of the input port of this operator.- Returns:
- required data distribution property (may be null)
-
setCharset
public void setCharset(String charsetName)
Set character set value which is used to format input and output data for R. Default value is set to UTF-8.- Parameters:
charsetName
- character set name for encoding
-
getCharset
public String getCharset()
Get character set value for input and output data.- Returns:
- character set
-
compose
protected void compose(CompositionContext ctx)
Description copied from class:CompositeOperator
Compose the body of this operator. Implementations should do the following:- Perform any validation of configuration, input types, etc
- Instantiate and configure sub-operators, adding them to the provided context via
the method
OperatorComposable.add(O)
- Create necessary connections via the method
OperatorComposable.connect(P, P)
. This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
- Specified by:
compose
in classCompositeOperator
- Parameters:
ctx
- the context
-
validateFieldName
public static String validateFieldName(int position, String fldName)
Function to ensure an input/output field name conforms to R variable identifier requirements. If the field name is not valid, performs the following standardization on the field to produce a valid identifier. This is provided as a convenience for proper identifier formatting within scripts. Input and output schemas set on this operator undergo name validation so this is a mechanism that can be used to ensure identifier referencing is consistent.- Parameters:
position
- the ordinal position of the field in the source/target schema - may be needed to disambiguate field names after character substitution takes place.fldName
- the name of the field being validated- Returns:
- an R compliant variable identifier
-
getIdentifierRegex
public static String getIdentifierRegex()
Function to return a Regular Expression string that can be used to identify a valid R identifier. The string returned here is used internally for the validateFieldName function.- Returns:
- the Regular Expression string used to identify an R compliant identifier
-
-