Class RowsToColumns

  • All Implemented Interfaces:
    LogicalOperator, PipelineOperator<RecordPort>, RecordPipelineOperator

    public class RowsToColumns
    extends CompositeOperator
    implements RecordPipelineOperator
    The RowsToColumns operator is used to pivot data from a narrow representation (rows) into a wider representation (columns). The data is first segmented into groups using a defined set of group keys. The ordering of the group keys is important as it defines how the data is partitioned and ordered for the pivot operation. A pivot key field contains the distinct values that will be used as the pivot point. This field must be typed as a string. A column is created in the output for each distinct value of the pivot key. An aggregation is defined which is performed on each data grouping defined by the group field(s) and the pivot field. The result of the aggregation for each unique value of the pivot field appears in the appropriate output column.

    The table below provides an example set of data that we want to pivot by Region. There are only four regions: North, South, East and West. For each item, we'd like to compute the total sales per region. Items can show up multiple times in a region, as the data is also organized by store.

    ItemIDStoreIDRegionSales
    110North1000
    115North800
    120South500
    130East700
    240West1200
    210North500
    215North200

    To accomplish this pivot, the ItemID will be used as the group key. The Region will be used as the pivot key. And the Sales column will be the pivot value, aggregating by summing the values. The pivot key values are: "North", "South", "East" and "West". The result of the pivot is shown in the table below. Note that the sales total for the West region for item 1 is empty. Scanning the input data shows that no sales were present in the West region for item 1. Item 1 did have two sales values for the North region. Those values (1000 and 800) are summed and the total (1800) appears in the North region column for item 1. Column values of ? indicate a null (non-existing) value.

    ItemIDNorthSouthEastWest
    11800500700?
    2700??1200

    The key concepts to understand usage of the RowsToColumns operator are:

    • The use of a set of columns to segment the data into groups for pivoting. This set of column is called the group keys. The example used the ItemID as the group key.
    • A categorical valued column whose distinct values are used as columns in the output. This is the pivot key. The example used the Region as the pivot key.
    • A column that can be aggregated for each data grouping and within each pivot key. These are the pivot values. The example used the Sales column as the pivot value.
    • Constructor Detail

      • RowsToColumns

        public RowsToColumns()
    • Method Detail

      • getGroupKeys

        public List<String> getGroupKeys()
        Get the list of fields used to segment the input data into groups.
        Returns:
        ordered list of grouping field names
      • setGroupKeys

        public void setGroupKeys​(List<String> groupKeys)
        Set the list of fields used to segment the input data into groups for pivoting.
        Parameters:
        groupKeys - ordered list of grouping field names
      • setGroupKeys

        public void setGroupKeys​(String... groupKeys)
        Set the list of fields used to segment the input data into groups for pivoting.
        Parameters:
        groupKeys - ordered list of grouping field names
      • getPivotKey

        public String getPivotKey()
        Get the name of the field used as the pivot key.
        Returns:
        name of pivot key field
      • setPivotKey

        public void setPivotKey​(String pivotKey)
        Set the name of the field used as the pivot key. This field is used as the pivot point. Distinct values from this field are used to determine the output fields of the pivot.
        Parameters:
        pivotKey - name of the pivot key field
      • getAggregations

        public List<Aggregation> getAggregations()
        Get the aggregation(s) applied to each pivot group.
        Returns:
        aggregations applied to pivot groups
      • setAggregations

        public void setAggregations​(List<Aggregation> aggregations)
        Set the aggregation(s) to apply to each pivot grouping. The aggregation(s) define the input field(s) that make up the pivot values. These values are aggregated per pivot group and the results placed in a field within the output. A set of output fields will be produced per aggregation defined.
        Parameters:
        aggregations - aggregation(s) to apply to each pivot data group
      • setAggregations

        public void setAggregations​(Aggregation... aggregations)
        Set the aggregation(s) to apply to each pivot grouping. The aggregation(s) define the input field(s) that make up the pivot values. These values are aggregated per pivot group and the results placed in a field within the output. A set of output fields will be produced per aggregation defined.
        Parameters:
        aggregations - aggregation(s) to apply to each pivot data group
      • setAggregations

        public void setAggregations​(String aggregationExpression)
        Set the aggregation(s) to apply to each pivot grouping using an aggregation expression. The aggregation(s) define the input field(s) that make up the pivot values. These values are aggregated per pivot group and the results placed in a field within the output. A set of output fields will be produced per aggregation defined.
        Parameters:
        aggregations - aggregation(s) to apply to each pivot data group
      • getPivotKeyValues

        public List<String> getPivotKeyValues()
        Get the list of distinct values of the pivot key field.
        Returns:
        distinct pivot key values
      • setPivotKeyValues

        public void setPivotKeyValues​(List<String> pivotValues)
        Set the list of distinct pivot key values. An output field will be created for each value per aggregation defined. If the input pivot key field contains values not present in this list, those input values will be ignored.
        Parameters:
        pivotValues - list of distinct pivot key values
      • setPivotKeyValues

        public void setPivotKeyValues​(String... pivotValues)
        Set the list of distinct pivot key values. An output field will be created for each value per aggregation defined. If the input pivot key field contains values not present in this list, those input values will be ignored.
        Parameters:
        pivotValues - list of distinct pivot key values
      • setPivotColumnPattern

        public void setPivotColumnPattern​(String pattern)
        Sets the naming pattern used for new pivot columns. This will be used to determine the actual names of the pivot columns.
        Parameters:
        pattern - name pattern for pivot columns
      • compose

        protected void compose​(CompositionContext ctx)
        Description copied from class: CompositeOperator
        Compose the body of this operator. Implementations should do the following:
        1. Perform any validation of configuration, input types, etc
        2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
        3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
        Specified by:
        compose in class CompositeOperator
        Parameters:
        ctx - the context