Class RowsToColumns

All Implemented Interfaces:
LogicalOperator, PipelineOperator<RecordPort>, RecordPipelineOperator

public class RowsToColumns extends CompositeOperator implements RecordPipelineOperator
The RowsToColumns operator is used to pivot data from a narrow representation (rows) into a wider representation (columns). The data is first segmented into groups using a defined set of group keys. The ordering of the group keys is important as it defines how the data is partitioned and ordered for the pivot operation. A pivot key field contains the distinct values that will be used as the pivot point. This field must be typed as a string. A column is created in the output for each distinct value of the pivot key. An aggregation is defined which is performed on each data grouping defined by the group field(s) and the pivot field. The result of the aggregation for each unique value of the pivot field appears in the appropriate output column.

The table below provides an example set of data that we want to pivot by Region. There are only four regions: North, South, East and West. For each item, we'd like to compute the total sales per region. Items can show up multiple times in a region, as the data is also organized by store.

ItemIDStoreIDRegionSales
110North1000
115North800
120South500
130East700
240West1200
210North500
215North200

To accomplish this pivot, the ItemID will be used as the group key. The Region will be used as the pivot key. And the Sales column will be the pivot value, aggregating by summing the values. The pivot key values are: "North", "South", "East" and "West". The result of the pivot is shown in the table below. Note that the sales total for the West region for item 1 is empty. Scanning the input data shows that no sales were present in the West region for item 1. Item 1 did have two sales values for the North region. Those values (1000 and 800) are summed and the total (1800) appears in the North region column for item 1. Column values of ? indicate a null (non-existing) value.

ItemIDNorthSouthEastWest
11800500700?
2700??1200

The key concepts to understand usage of the RowsToColumns operator are:

  • The use of a set of columns to segment the data into groups for pivoting. This set of column is called the group keys. The example used the ItemID as the group key.
  • A categorical valued column whose distinct values are used as columns in the output. This is the pivot key. The example used the Region as the pivot key.
  • A column that can be aggregated for each data grouping and within each pivot key. These are the pivot values. The example used the Sales column as the pivot value.
  • Field Details

  • Constructor Details

    • RowsToColumns

      public RowsToColumns()
  • Method Details

    • getInput

      public RecordPort getInput()
      Description copied from interface: PipelineOperator
      Returns the input port
      Specified by:
      getInput in interface PipelineOperator<RecordPort>
      Returns:
      the input port
    • getOutput

      public RecordPort getOutput()
      Description copied from interface: PipelineOperator
      Returns the output port
      Specified by:
      getOutput in interface PipelineOperator<RecordPort>
      Returns:
      the output port
    • getGroupKeys

      public List<String> getGroupKeys()
      Get the list of fields used to segment the input data into groups.
      Returns:
      ordered list of grouping field names
    • setGroupKeys

      public void setGroupKeys(List<String> groupKeys)
      Set the list of fields used to segment the input data into groups for pivoting.
      Parameters:
      groupKeys - ordered list of grouping field names
    • setGroupKeys

      public void setGroupKeys(String... groupKeys)
      Set the list of fields used to segment the input data into groups for pivoting.
      Parameters:
      groupKeys - ordered list of grouping field names
    • getPivotKey

      public String getPivotKey()
      Get the name of the field used as the pivot key.
      Returns:
      name of pivot key field
    • setPivotKey

      public void setPivotKey(String pivotKey)
      Set the name of the field used as the pivot key. This field is used as the pivot point. Distinct values from this field are used to determine the output fields of the pivot.
      Parameters:
      pivotKey - name of the pivot key field
    • getAggregations

      public List<Aggregation> getAggregations()
      Get the aggregation(s) applied to each pivot group.
      Returns:
      aggregations applied to pivot groups
    • setAggregations

      public void setAggregations(List<Aggregation> aggregations)
      Set the aggregation(s) to apply to each pivot grouping. The aggregation(s) define the input field(s) that make up the pivot values. These values are aggregated per pivot group and the results placed in a field within the output. A set of output fields will be produced per aggregation defined.
      Parameters:
      aggregations - aggregation(s) to apply to each pivot data group
    • setAggregations

      public void setAggregations(Aggregation... aggregations)
      Set the aggregation(s) to apply to each pivot grouping. The aggregation(s) define the input field(s) that make up the pivot values. These values are aggregated per pivot group and the results placed in a field within the output. A set of output fields will be produced per aggregation defined.
      Parameters:
      aggregations - aggregation(s) to apply to each pivot data group
    • setAggregations

      public void setAggregations(String aggregationExpression)
      Set the aggregation(s) to apply to each pivot grouping using an aggregation expression. The aggregation(s) define the input field(s) that make up the pivot values. These values are aggregated per pivot group and the results placed in a field within the output. A set of output fields will be produced per aggregation defined.
      Parameters:
      aggregations - aggregation(s) to apply to each pivot data group
    • getPivotKeyValues

      public List<String> getPivotKeyValues()
      Get the list of distinct values of the pivot key field.
      Returns:
      distinct pivot key values
    • setPivotKeyValues

      public void setPivotKeyValues(List<String> pivotValues)
      Set the list of distinct pivot key values. An output field will be created for each value per aggregation defined. If the input pivot key field contains values not present in this list, those input values will be ignored.
      Parameters:
      pivotValues - list of distinct pivot key values
    • setPivotKeyValues

      public void setPivotKeyValues(String... pivotValues)
      Set the list of distinct pivot key values. An output field will be created for each value per aggregation defined. If the input pivot key field contains values not present in this list, those input values will be ignored.
      Parameters:
      pivotValues - list of distinct pivot key values
    • setPivotColumnPattern

      public void setPivotColumnPattern(String pattern)
      Sets the naming pattern used for new pivot columns. This will be used to determine the actual names of the pivot columns.
      Parameters:
      pattern - name pattern for pivot columns
    • compose

      protected void compose(CompositionContext ctx)
      Description copied from class: CompositeOperator
      Compose the body of this operator. Implementations should do the following:
      1. Perform any validation of configuration, input types, etc
      2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
      3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
      Specified by:
      compose in class CompositeOperator
      Parameters:
      ctx - the context