Class AvroSchemaUtils


  • public class AvroSchemaUtils
    extends Object
    Utilities for working with Avro schemas. Contain methods which can map between DataRush and Avro types, as well as methods useful for extracting information about Avro encodings.
    • Method Detail

      • cleanseName

        public static String cleanseName​(String fieldName)
        Cleanses the specified name so it is a valid field name in Avro. Valid field names:
        • Start with an underscore or alphabetic character.
        • Contain only underscores and alphanumeric characters.
        Cleansing is done by replacing invalid characters with an underscore. Note that this can map different field names in DataRush to the same name in Avro.
        Parameters:
        fieldName - the name to cleanse
        Returns:
        a name valid for use in Avro
      • generateSchema

        public static org.apache.avro.Schema generateSchema​(RecordTokenType type)
        Creates an Avro schema from the given DataRush record type.

        The generated schema is an Avro RECORD consisting of fields in the the same order as the record, having the same names. Field names are cleansed to be valid Avro field names using cleanseName(String). If this cleansing results in a name collision, an error is raised. Each field in the generated schema will have a UNION type including NULL and the appropriate Avro schema type based on the input type as listed below:

        • BOOLEAN, DOUBLE, FLOAT, LONG, and INT are mapped to the Avro primitive type of the same name.
        • STRING is mapped differently based on the presence of a domain on the source field. If no domain is specified, it is mapped to the STRING primitive type. If a domain is specified, it is mapped to an ENUM having the same set of symbols as the domain.
        • BINARY is mapped to the BYTES primitive type.
        • NUMERIC is mapped to the DOUBLE primitive type; this may result in loss of precision.
        • CHAR is mapped to the STRING primitive type.
        • DATE is mapped to a nested RECORD having one field epochDays of type LONG. The value of this field is the same as DateValued#asEpochDays().
        • TIME is mapped to a nested RECORD having one field dayMillis of type INT. The value of this field is the same as TimeValued#asDayMillis().
        • TIMESTAMP is mapped to a nested RECORD having three fields: epochSecs of type LONG, subsecNanos of type INT, and offsetSecs of type INT. The value of these fields the same as those of the values with the same names in TimestampValued.
        Parameters:
        type - the for which to generate a schema
        Returns:
        an Avro schema describing the given type
      • isWritable

        public static boolean isWritable​(ScalarTokenType type,
                                         org.apache.avro.Schema schema)
        Indicates whether the specified DataRush type can be encoded in the given schema.
        Parameters:
        type - the field type to check
        schema - the target schema for the field
        Returns:
        true if the target schema permits values of the specified type to be written (excluding consideration of null values), false otherwise.
      • determineType

        public static RecordTokenType determineType​(org.apache.avro.Schema schema)
        Maps an Avro schema to a DataRush record type.

        The provided schema will be converted to a record type having fields of the same name and appearing in the same order. If the schema is not of RECORD type, it will be treated as if it were a single field name "field0" in a records.

        Fields with primitive Avro types are mapped to DataRush as indicated in the table below:

        Source Avro TypeTarget DataRush Type
        BOOLEANBOOLEAN
        BYTESBINARY
        DOUBLEDOUBLE
        FIXEDBINARY
        FLOATFLOAT
        LONGLONG
        INTINT
        STRINGSTRING

        For complex Avro datatypes, the mapping to DataRush is as follows:

        • RECORD data in Avro will, in general, be mapped to a DataRush record type as long as each field can be mapped to a scalar type. Nested records are not currently allowed except for the Avro RECORD representations of DataRush DATE, TIME, and TIMESTAMP types as described in the WriteAvro operator.
        • ENUM data in Avro will be mapped to the DataRush string type, setting the domain to the enumerated list of symbols.
        • UNION data in Avro can be mapped only if it a union of NULL and exactly one other type which can be mapped to a scalar type.
        • ARRAY and MAP data in Avro is not currently supported.
        Parameters:
        schema - the schema for which to determine the equivalent record type
        Returns:
        a record type describing the data represented by the schema
        Throws:
        DRException - if the schema cannot be converted to a record type
      • isNullable

        public static boolean isNullable​(org.apache.avro.Schema schema)
        Indicates whether the specified Avro schema supports setting null values. That is, whether the schema is a UNION and has one branch of type NULL.
        Parameters:
        schema - the schema to check
        Returns:
        true if a null value can be written to the schema, false otherwise.