Class DataDistribution

java.lang.Object
com.pervasive.datarush.ports.record.DataDistribution
Direct Known Subclasses:
FullDataDistribution, PartialDataDistribution

public abstract class DataDistribution extends Object
DataDistribution is the component of RecordMetadata that describes how the data is distributed. Distributions are usually partial, meaning data is partitioned in some way throughout the cluster or among different threads in the case of pseudo-distributed operation. In rare cases a full distribution is required, but that should only be used when data is "small" since it must be replicated throughout all nodes in the cluster.

Operators may declare a required distribution by calling RecordPort.setRequiredDataDistribution(com.pervasive.datarush.operators.MetadataCalculationContext, com.pervasive.datarush.ports.record.DataDistribution). It is the responsibility of the framework to ensure that requirement is met. Operators may also declare their output distribution by calling RecordPort.setOutputDataDistribution(com.pervasive.datarush.operators.MetadataCalculationContext, com.pervasive.datarush.ports.record.DataDistribution). This lets the framework know how data is distributed on the operator's output. If there is an mismatch between required and provided metadata, the framework will automatically redistribute as needed.

The following table lists several combinations of source and target distributions along wih the actions taken by the framework for each combination.

Source DistributionRequired DistributionFramework Action
FullDataDistributionFullDataDistributionNone required
FullDataDistributionAny PartialError (unsupported)
Any PartialFullDataDistributionRedistribute
Any PartialUnspecifiedPartialDistributionNone required
BalancedDistributionBalancedDistributionNone required
Any Partial (other than balanced)BalancedDistributionRedistribute evenly
KeyDrivenDataDistribution hashed on keys [a,b]KeyDrivenDataDistribution hashed on keys [a,b]None required
KeyDrivenDataDistribution hashed on keys [a,b]KeyDrivenDataDistribution hashed on keys [a,b,c]Redistribute on keys [a,b,c]
See Also:
  • StreamingOperator#computeMetadata
  • IterativeOperator#computeMetadata
  • Constructor Details

    • DataDistribution

      public DataDistribution()
  • Method Details

    • toString

      public abstract String toString()
      Overrides:
      toString in class Object
    • remap

      public abstract DataDistribution remap(FieldRemapping mapping)
      Applies the given field remapping to this mapping, changing names as required. Distributions that reference keys must have their key names remapped.
      Parameters:
      mapping - the field remapping.
      Returns:
      this distribution, remapped to the new names.
    • getAliases

      public abstract AliasSet[] getAliases()
      Returns the fields that are referenced by this distribution. Note that it is valid for a distribution to reference no fields, in which case it should return an empty array. This method is used by the framework to validate the distribution is consistent with the type of the record.
      Returns:
      the fields that are referenced by this distribution