Class SampleRandomRows

  • All Implemented Interfaces:
    LogicalOperator, PipelineOperator<RecordPort>, RecordPipelineOperator

    public class SampleRandomRows
    extends AbstractRecordCompositeOperator
    Apply random sampling to the input data. The schema of the output data matches that of the input. The output data usually contains fewer rows than the input. The number of rows output varies depending on the value of the percent or the sampleSize property.

    The sampling can be executed in one of two modes:

    • BY_PERCENT: the specified percentage of rows will be output
    • BY_SIZE: the rows output depend on the given sample size and the total number of rows of the input data

    For example, using BY_PERCENT mode with 10000 input rows and percent set to 0.25, you can expect approximately 2500 rows of output. This value is not exact. It will vary with different settings of the seed property.

    In contrast, using BY_SIZE mode with any input data size and sampleSize set to 2500, you can expect approximately 2500 rows of output. This value is not exact. It will vary with different settings of the seed property. Use BY_SIZE when you want to have a specific number of rows output. The sampleSize property sets an upper limit on the number of rows that will be output.

    The seed property is set to the current time (System.currentTimeMillis() by default. Override this value to specify the random seed to use.

    • Constructor Detail

      • SampleRandomRows

        public SampleRandomRows()
        Performs default random sampling on the data. By default, sampling will select a fixed percentage of the input.
      • SampleRandomRows

        public SampleRandomRows​(double percent,
                                long seed)
        Perform random sampling selecting a fixed percentage of the input.
        Parameters:
        percent - percentage of the input data wanted
        seed - seed value for the random number generator
      • SampleRandomRows

        public SampleRandomRows​(long sampleSize,
                                long seed)
        Perform random sampling selecting a fixed number of records from the input data.
        Parameters:
        sampleSize - the wanted output sample size (in rows)
        seed - seed value for the random number generator
    • Method Detail

      • getSeed

        public Long getSeed()
        Get the random number generator seed value.
        Returns:
        random number generator seed value
      • setSeed

        public void setSeed​(Long seed)
        Set the random number generator seed value.
        Parameters:
        seed - random number generator seed value
      • getPercent

        public double getPercent()
        Get the percentage of input data to output.
        Returns:
        percentage of input data
      • setPercent

        public void setPercent​(double percent)
        Set the percentage of input data wanted. This value must be in the range: 0 < seed < 1.0. This value is only used of the sample mode is BY_PERCENT.
        Parameters:
        percent - percentage of input data
      • getSampleSize

        public long getSampleSize()
        Get the sample size (in rows) of data wanted.
        Returns:
        sample size
      • setSampleSize

        public void setSampleSize​(long sampleSize)
        Set the wanted sample size in rows. Set this value when using the BY_SIZE sample mode. The operator will output approximately the sample size number of rows.
        Parameters:
        sampleSize - wanted sample size
      • getMode

        public SampleMode getMode()
        Get the sample mode.
        Returns:
        sample mode
      • setMode

        public void setMode​(SampleMode mode)
        Set the sample mode.
        Parameters:
        mode - sample mode
      • compose

        protected void compose​(CompositionContext ctx)
        Description copied from class: CompositeOperator
        Compose the body of this operator. Implementations should do the following:
        1. Perform any validation of configuration, input types, etc
        2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
        3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
        Specified by:
        compose in class CompositeOperator
        Parameters:
        ctx - the context