java.lang.Object
com.pervasive.datarush.operators.AbstractLogicalOperator
com.pervasive.datarush.operators.CompositeOperator
com.pervasive.datarush.analytics.r.RunRScript
- All Implemented Interfaces:
LogicalOperator,PipelineOperator<RecordPort>,RecordPipelineOperator
Execute an R script in flow. The input data will be presented to the R script as the data frame named "R".
The data frame with the same name ("R") will be used to write to the output. CSV format is used to
import data into R and export data from R. The auto-discovery mechanism of R when reading CSV data
is relied upon to handle data typing.
This operator is parallelized by default. This implies that multiple R instances are initialized and the given script is executed within each instance with a different segment of the input data. Disable parallelism for the operator if any data dependencies exist due to R functions used in the script.
Two variables are initially set within the R environment prior to execution within each operator instance:
- partitionID: zero-based identifier of the instance partition
- partitionCount: the total number of data partitions created to execute the current graph
Use the setPathToRScript(String) method to set the path to the Rscript executable. This executable
is used instead of the R executable as it is meant to run a single R script specified on the
command line.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected voidCompose the body of this operator.Get character set value for input and output data.static StringFunction to return a Regular Expression string that can be used to identify a valid R identifier.getInput()Returns the input portReturns the output portGet the configured output type.Get the path to the Rscript binary within the local R installation.Get the value set for the required data distribution of the input port of this operator.Get the R script to execute.voidsetCharset(String charsetName) Set character set value which is used to format input and output data for R.voidsetOutputType(RecordTokenType outputType) Configure the output token type.voidsetPathToRScript(String pathToRScript) Set the path to the Rscript binary within the local R installation.voidsetRequiredDataDistribution(DataDistribution dataDistribution) Set the data distribution requirements for the input port of this operator.voidsetScriptSnippet(String scriptSnippet) Set the R script to execute.static StringvalidateFieldName(int position, String fldName) Function to ensure an input/output field name conforms to R variable identifier requirements.Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyErrorMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface com.pervasive.datarush.operators.LogicalOperator
disableParallelism, getInputPorts, getOutputPorts
-
Constructor Details
-
RunRScript
public RunRScript()
-
-
Method Details
-
getInput
Description copied from interface:PipelineOperatorReturns the input port- Specified by:
getInputin interfacePipelineOperator<RecordPort>- Returns:
- the input port
-
getOutput
Description copied from interface:PipelineOperatorReturns the output port- Specified by:
getOutputin interfacePipelineOperator<RecordPort>- Returns:
- the output port
-
getOutputType
Get the configured output type.- Returns:
- output record token type
-
setOutputType
Configure the output token type. This is a required property. The type cannot be discovered since the output is determined by the code of the given script.- Parameters:
outputType- output record token type
-
getPathToRScript
Get the path to the Rscript binary within the local R installation.- Returns:
- path to RScript executable
-
setPathToRScript
Set the path to the Rscript binary within the local R installation. This executable is normally found in the bin directory of the R installation. If this operator is executed on a cluster, R must be installed on each machine within the cluster in a consistent location.- Parameters:
pathToRScript- path to RScript executable
-
getScriptSnippet
Get the R script to execute.- Returns:
- R script
-
setScriptSnippet
Set the R script to execute.- Parameters:
scriptSnippet-
-
setRequiredDataDistribution
Set the data distribution requirements for the input port of this operator. Since this scripting operator has no way to determine what a script may do, specifying the required data distribution allows the operator user to force the distribution requirements of the input port.- Parameters:
dataDistribution- input data distribution
-
getRequiredDataDistribution
Get the value set for the required data distribution of the input port of this operator.- Returns:
- required data distribution property (may be null)
-
setCharset
Set character set value which is used to format input and output data for R. Default value is set to UTF-8.- Parameters:
charsetName- character set name for encoding
-
getCharset
Get character set value for input and output data.- Returns:
- character set
-
compose
Description copied from class:CompositeOperatorCompose the body of this operator. Implementations should do the following:- Perform any validation of configuration, input types, etc
- Instantiate and configure sub-operators, adding them to the provided context via
the method
OperatorComposable.add(O) - Create necessary connections via the method
OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
- Specified by:
composein classCompositeOperator- Parameters:
ctx- the context
-
validateFieldName
Function to ensure an input/output field name conforms to R variable identifier requirements. If the field name is not valid, performs the following standardization on the field to produce a valid identifier. This is provided as a convenience for proper identifier formatting within scripts. Input and output schemas set on this operator undergo name validation so this is a mechanism that can be used to ensure identifier referencing is consistent.- Parameters:
position- the ordinal position of the field in the source/target schema - may be needed to disambiguate field names after character substitution takes place.fldName- the name of the field being validated- Returns:
- an R compliant variable identifier
-
getIdentifierRegex
Function to return a Regular Expression string that can be used to identify a valid R identifier. The string returned here is used internally for the validateFieldName function.- Returns:
- the Regular Expression string used to identify an R compliant identifier
-