public class WriteAvro extends AbstractWriter
If appending to an existing target, data will be encoded using the schema stored with the target. If writing to a new file (including an append to a non-existent target) or overwriting an existing target, the supplied schema, if any will be used to encode the data. If none was provided, a schema will be generated using the input data type, as described later.
Input fields are mapped to the output schema by name. Any input fields not mapped in the schema are dropped. Any schema fields not in the input are implicitly null-valued. In this case, if the schema does not allow NULL for the field, an error will be raised.
Input fields must map to a supported Avro type. An error will be raised if no valid mapping exists. Furthermore, during execution, if an input field is null-valued, but the target field is not nullable - that is, is not a UNION including a NULL type - the graph will fail. Valid source field types for primitive Avro types is shown in the table below.
Target Avro Type | Source DataRush Types |
---|---|
BOOLEAN | BOOLEAN |
BYTES | BINARY |
DOUBLE | DOUBLE, FLOAT, LONG, INT |
FIXED | BINARY |
FLOAT | FLOAT, LONG, INT |
LONG | LONG, INT |
INT | INT |
STRING | STRING |
Mapping to complex Avro datatypes is handled as follows:
As noted previously, a schema will be generated if necessary. The generated schema is an Avro RECORD
consisting of fields in the the same order as the input record, having the same names. Field names
are cleansed to be valid Avro field names using AvroSchemaUtils.cleanseName(String)
.
If this cleansing results in a name collision, an error is raised.
Each field in the generated schema will have a UNION type including NULL and the appropriate Avro schema type
based on the input type as listed below:
domain
on the source field. If no domain is specified, it is mapped to the STRING primitive type.
If a domain is specified, it is mapped to an ENUM having the same set of symbols as the domain.DateValued#asEpochDays()
.TimeValued#asDayMillis()
.TimestampValued
.input, options
Constructor and Description |
---|
WriteAvro()
Writes an empty target with default settings.
|
WriteAvro(boolean includeDoneSignal)
Writes an empty target with default settings,
optionally providing a port for signaling
completion of the write.
|
WriteAvro(ByteSink target,
WriteMode mode)
Writes to the specified target sink in the given
mode.
|
WriteAvro(Path path,
WriteMode mode)
Writes to the specified path in the given mode.
|
WriteAvro(String path,
WriteMode mode)
Writes to the specified path in the given mode.
|
Modifier and Type | Method and Description |
---|---|
protected DataFormat |
computeFormat(CompositionContext ctx)
Determines the data format for the target.
|
Compression |
getCompression()
Gets the currently specified compression method for
writing data blocks.
|
org.apache.avro.Schema |
getSchema()
Gets the currently specified schema for encoding the output.
|
void |
setCompression(Compression codec)
Sets the compression method to use on data blocks.
|
void |
setSchema(org.apache.avro.Schema schema)
Sets the schema to use to encode the output.
|
void |
setSchema(String schemaFile)
Sets the schema to use to encode the output with the
JSON-serialized schema in the specified file.
|
compose, getFormatOptions, getInput, getMode, getSaveMetadata, getTarget, getWriteBuffer, getWriteOnClient, getWriteSingleSink, isIgnoreSortOrder, setFormatOptions, setIgnoreSortOrder, setMode, setSaveMetadata, setTarget, setTarget, setTarget, setWriteBuffer, setWriteOnClient, setWriteSingleSink
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
disableParallelism, getInputPorts, getOutputPorts
public WriteAvro()
AbstractWriter.setTarget(ByteSink)
public WriteAvro(boolean includeDoneSignal)
includeDoneSignal
- indicates whether a
done signal port should be createdAbstractWriter.setTarget(ByteSink)
public WriteAvro(String path, WriteMode mode)
If the writer is parallelized, this is interpreted as a directory in which each partition will write a fragment of the entire input stream. Otherwise, it is interpreted as the file to write.
path
- the path to which to writemode
- how to handle existing filespublic WriteAvro(Path path, WriteMode mode)
If the writer is parallelized, this is interpreted as a directory in which each partition will write a fragment of the entire input stream. Otherwise, it is interpreted as the file to write.
path
- the path to which to writemode
- how to handle existing filespublic WriteAvro(ByteSink target, WriteMode mode)
The writer can only be parallelized if the sink is fragmentable. In this case, each partition will be written as an independent sink. Otherwise, the writer will run non-parallel.
target
- the sink to which to writemode
- how to handle an existing sinkpublic org.apache.avro.Schema getSchema()
public void setSchema(org.apache.avro.Schema schema)
A schema does not have to be provided, in which case one will automatically be generated from the input data.
schema
- the Avro schema to use when writingpublic void setSchema(String schemaFile) throws IOException
schemaFile
- the name of the file containing the
serialized Avro schemaIOException
public Compression getCompression()
public void setCompression(Compression codec)
codec
- the compression to use in the output fileprotected DataFormat computeFormat(CompositionContext ctx)
AbstractWriter
WriteSink
operator. If an
implementation supports schema discovery, it must be
performed in this method.computeFormat
in class AbstractWriter
ctx
- the composition context for the current invocation
of AbstractWriter.compose(CompositionContext)
Copyright © 2020 Actian Corporation. All rights reserved.