Class SampleRandomRows

All Implemented Interfaces:
LogicalOperator, PipelineOperator<RecordPort>, RecordPipelineOperator

public class SampleRandomRows extends AbstractRecordCompositeOperator
Apply random sampling to the input data. The schema of the output data matches that of the input. The output data usually contains fewer rows than the input. The number of rows output varies depending on the value of the percent or the sampleSize property.

The sampling can be executed in one of two modes:

  • BY_PERCENT: the specified percentage of rows will be output
  • BY_SIZE: the rows output depend on the given sample size and the total number of rows of the input data

For example, using BY_PERCENT mode with 10000 input rows and percent set to 0.25, you can expect approximately 2500 rows of output. This value is not exact. It will vary with different settings of the seed property.

In contrast, using BY_SIZE mode with any input data size and sampleSize set to 2500, you can expect approximately 2500 rows of output. This value is not exact. It will vary with different settings of the seed property. Use BY_SIZE when you want to have a specific number of rows output. The sampleSize property sets an upper limit on the number of rows that will be output.

The seed property is set to the current time (System.currentTimeMillis() by default. Override this value to specify the random seed to use.

  • Constructor Details

    • SampleRandomRows

      public SampleRandomRows()
      Performs default random sampling on the data. By default, sampling will select a fixed percentage of the input.
    • SampleRandomRows

      public SampleRandomRows(double percent, long seed)
      Perform random sampling selecting a fixed percentage of the input.
      Parameters:
      percent - percentage of the input data wanted
      seed - seed value for the random number generator
    • SampleRandomRows

      public SampleRandomRows(long sampleSize, long seed)
      Perform random sampling selecting a fixed number of records from the input data.
      Parameters:
      sampleSize - the wanted output sample size (in rows)
      seed - seed value for the random number generator
  • Method Details

    • getInput

      public RecordPort getInput()
      Description copied from interface: PipelineOperator
      Returns the input port
      Specified by:
      getInput in interface PipelineOperator<RecordPort>
      Overrides:
      getInput in class AbstractRecordCompositeOperator
      Returns:
      the input port
    • getOutput

      public RecordPort getOutput()
      Description copied from interface: PipelineOperator
      Returns the output port
      Specified by:
      getOutput in interface PipelineOperator<RecordPort>
      Overrides:
      getOutput in class AbstractRecordCompositeOperator
      Returns:
      the output port
    • getSeed

      public Long getSeed()
      Get the random number generator seed value.
      Returns:
      random number generator seed value
    • setSeed

      public void setSeed(Long seed)
      Set the random number generator seed value.
      Parameters:
      seed - random number generator seed value
    • getPercent

      public double getPercent()
      Get the percentage of input data to output.
      Returns:
      percentage of input data
    • setPercent

      public void setPercent(double percent)
      Set the percentage of input data wanted. This value must be in the range: 0 < seed < 1.0. This value is only used of the sample mode is BY_PERCENT.
      Parameters:
      percent - percentage of input data
    • getSampleSize

      public long getSampleSize()
      Get the sample size (in rows) of data wanted.
      Returns:
      sample size
    • setSampleSize

      public void setSampleSize(long sampleSize)
      Set the wanted sample size in rows. Set this value when using the BY_SIZE sample mode. The operator will output approximately the sample size number of rows.
      Parameters:
      sampleSize - wanted sample size
    • getMode

      public SampleMode getMode()
      Get the sample mode.
      Returns:
      sample mode
    • setMode

      public void setMode(SampleMode mode)
      Set the sample mode.
      Parameters:
      mode - sample mode
    • compose

      protected void compose(CompositionContext ctx)
      Description copied from class: CompositeOperator
      Compose the body of this operator. Implementations should do the following:
      1. Perform any validation of configuration, input types, etc
      2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
      3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
      Specified by:
      compose in class CompositeOperator
      Parameters:
      ctx - the context