Class UnionAll

  • All Implemented Interfaces:
    LogicalOperator, SourceOperator<RecordPort>

    public class UnionAll
    extends DeferredCompositeOperator
    implements SourceOperator<RecordPort>
    Provides a union of two data sources. The input data is usually consumed as it becomes available. If the data order specified in the metadata of both inputs match, then the operator will perform a sorted merge preserving the sort order of the input. Otherwise the output will be ordered non-deterministically.

    The type of the output is determined by setting the union mode. If the outputMapping setting is set to MAPBYSCHEMA then a schema must be provided which will define the output type. Otherwise the output type can be automatically determined by setting MAPBYPOSITION or MAPBYNAME, which will determine an appropriate output type by mapping the two inputs by position or name respectively.

    In the case where a target schema is provided the input fields are mapped to output fields based on field name. If a field in the output is not contained in the target schema, the field is dropped. If a field is contained in the output schema, but not in the input, the output field will contain NULL values. Input values are converted into the specified output field type where possible.

    For example, if the left input contains fields {a:int, b:int, c:string} and the right input contains fields {a:long, b:double, d:string} and the target schema specifies fields {a:double, c:string, d:date, e:string} then the following is true:

    • The output port schema will contain the fields {a:double, c:string, d:date, e:string}.
    • Field b is dropped from both inputs.
    • When obtaining data from the left input port, the output fields {d, e} will contain null values.
    • When obtaining data from the right input port, the output fields {c, e} will contain null values.
    • Field a from the left input will be converted from an integer type to a double type.
    • Field a from the right input will be converted from a long type to a double type.
    • Field d from the right input will be converted from a string type to a date type. The format pattern specified in the target schema will be used for the conversion, if specified.

    In the case where a target schema is not provided, the two input ports must have compatible types. In this case the operator will try to determine valid output fields based on the left and right input and two settings. The outputMapping setting can be set to MAPBYNAME if the left and right side should be matched by field name. Otherwise the operator can use MAPBYPOSITION and they will be matched by position. Additionally the allowExtraFields setting can be set to true if fields that are only present on one side of the input should be retained. If the field is not present in one of the inputs it will contain NULL values. If extra fields are present in one of the inputs and allowExtraFields is set to false an error will be thrown.

    • Constructor Detail

      • UnionAll

        public UnionAll()
    • Method Detail

      • getLeft

        public final RecordPort getLeft()
        Returns the left input port.
        Returns:
        the left input port.
      • getRight

        public final RecordPort getRight()
        Returns the right input port.
        Returns:
        the right input port.
      • getSchema

        public RecordTextSchema<?> getSchema()
        Gets the target record schema defining the output type.
        Returns:
        the record schema defining the output type
      • setSchema

        public void setSchema​(RecordTextSchema<?> schema)
        Sets the optional target output schema. The input data sources will be coerced into this type.
        Parameters:
        schema - the record schema defining the output type
      • getOutputMapping

        public UnionAll.UnionMode getOutputMapping()
        Get how the output type should be determined. Can be set to MAPBYPOSITION, MAPBYNAME, or MAPBYSCHEMA. If the first two options are used the output type will be automatically determined by mapping the fields in the two inputs positionally or by name respectively. Otherwise if MAPBYSCHEMA is used the schema must be set. Defaults to MAPBYPOSITION.
        Returns:
        the union mapping mode
      • setOutputMapping

        public void setOutputMapping​(UnionAll.UnionMode outputMapping)
        Set how the output type should be determined. Can be set to MAPBYPOSITION, MAPBYNAME, or MAPBYSCHEMA. If the first two options are used the output type will be automatically determined by mapping the fields in the two inputs positionally or by name respectively. Otherwise if MAPBYSCHEMA is used the schema must be set. Defaults to MAPBYPOSITION.
        Parameters:
        outputMapping - the union mapping mode
      • getIncludeExtraFields

        public boolean getIncludeExtraFields()
        Will be true if the generated schema will include unmapped fields from either side. Only applies if MAPBYPOSITION or MAPBYNAME mode are used. Defaults to false.
        Returns:
        true if unmapped fields will be included in the generated schema
      • setIncludeExtraFields

        public void setIncludeExtraFields​(boolean includeExtraFields)
        Set to true if the generated schema should include unmapped field from either side. Only applies if MAPBYPOSITION or MAPBYNAME mode are used. Defaults to false.
        Parameters:
        includeExtraFields - if unmapped fields will be included in the generated schema
      • computeMetadata

        protected void computeMetadata​(StreamingMetadataContext ctx)
        This operator can execute in parallel. It does not guarantee order so order meta-data for the output is not set and resolves to the default. Also the distribution cannot be maintained as a change in data types for fields used for distribution invalidates any current settings.
        Specified by:
        computeMetadata in class DeferredCompositeOperator
        Parameters:
        ctx - the context
      • generateSchema

        public static RecordTokenType generateSchema​(RecordTokenType leftType,
                                                     RecordTokenType rightType,
                                                     UnionAll.UnionMode mode,
                                                     boolean keepExtraFields)
        Generate a schema for the union of two records.
        Parameters:
        leftType - the left record type
        rightType - the right record type
        mode - the union mode
        keepExtraFields - if true fields only present on one side of the input will be retained
        Returns:
        the combined record type