Class KeyDrivenDataDistribution


  • public final class KeyDrivenDataDistribution
    extends PartialStaticDataDistribution
    DataDistribution based on a set of keys from the input data. A KeyDrivenDataDistribution selects a set of keys, applying the specified base partitioning function to that set of keys. The function must be a function of input. This guarantees that data with rows with equivalent keys are located in the same partition.
    • Method Detail

      • hashed

        public static KeyDrivenDataDistribution hashed​(String... keys)
        Returns a distribution that distributes based on a hashcode of the named key fields.
        Parameters:
        keys - the names of the key fields
        Returns:
        a hash distribution
      • hashed

        public static KeyDrivenDataDistribution hashed​(List<String> keys)
        Returns a distribution that distributes based on a hashcode of the named key fields.
        Parameters:
        keys - the names of the key fields
        Returns:
        a hash distribution
      • ranges

        public static KeyDrivenDataDistribution ranges​(List<RecordToken> boundaries)
        Returns a distribution that distributes based on specified range boundaries
        Parameters:
        boundaries - the range boundaries
        Returns:
        a range distribution
      • localKeyed

        public static KeyDrivenDataDistribution localKeyed​(PartitioningFunction basePartitionFunction,
                                                           List<String> keys)
                                                    throws IllegalArgumentException
        For advanced use only; this redistributes per-JVM but not across the cluster. This is not a general-purpose partitioning strategy but can be used in certain cases where there is a performance benefit to locally repartitioning data.

        Returns a distribution that locally distributes based on the given function of the named key fields.

        Parameters:
        basePartitionFunction - the partition function
        keys - the names of the key fields
        Returns:
        a key driven distribution
        Throws:
        IllegalArgumentException - if the provided function is not a function of input
      • getBasePartitionFunction

        public PartitioningFunction getBasePartitionFunction()
        Returns the base partitioning function. This function is applied to a record consisting of the selected key fields.
        Returns:
        the base partitioning function
      • isLocal

        public boolean isLocal()
      • isGroupedBy

        public boolean isGroupedBy​(String[] keys)
        Returns true if this distribution guarantees groupings on the specified list of keys will not be split across partitions.
        Parameters:
        keys - the group keys
        Returns:
        whether this distribution preserves groupings across partitions
      • isGroupingCompatible

        public boolean isGroupingCompatible​(String[] keys)
        Returns true if this distribution guarantees groupings on the specified list of keys will not be split across partitions.
        Parameters:
        keys - the group keys
        Returns:
        whether this distribution preserves groupings across partitions
      • requiresRepartitionFrom

        protected boolean requiresRepartitionFrom​(PartialDataDistribution source)
        Description copied from class: PartialStaticDataDistribution
        Subclasses must override this method to declare whether a repartition is required given the source distribution. Implementations may err on the side of caution by always returning true but this may have an impact on performance.
        Specified by:
        requiresRepartitionFrom in class PartialStaticDataDistribution
        Parameters:
        source - the source distribution
        Returns:
        true if a repartition is required, false if this data distribution matches what was specified.
      • getKeys

        public String[] getKeys()
        Returns the keys by which we are partitioned.
        Returns:
        the keys by which we are partitioned.
      • getAliases

        public AliasSet[] getAliases()
        Description copied from class: DataDistribution
        Returns the fields that are referenced by this distribution. Note that it is valid for a distribution to reference no fields, in which case it should return an empty array. This method is used by the framework to validate the distribution is consistent with the type of the record.
        Specified by:
        getAliases in class DataDistribution
        Returns:
        the fields that are referenced by this distribution