public class ReadParquet extends AbstractReader
Parquet is a columnar file format used to store the tabular form of data. Parquet supports very efficient compression and encoding schemes and it allows compression schemes to be specified on per-column level. Parquet supports Compression Codecs like SNAPPY, GZIP, LZO. As Parquet is a representation of data in columnar format, it is supported by open source projects like Apache Hadoop ( MapReduce ), Apache Hive, Impala, etc;
DataFlow will automatically determine the equivalent data types from Parquet. The result will be the output type of the reader. However, as Parquet and DataFlow support different data types, not all data in Parquet format can be read. If attempting to read data which cannot be represented in DataFlow, an error will be raised.
Primitive Parquet types are mapped to DataFlow as indicated in the table below.
Parquet Type | DataFlow Type |
---|---|
BOOLEAN | BOOLEAN |
DOUBLE | DOUBLE |
FLOAT | FLOAT |
INT32 | INT |
INT64 | LONG |
BINARY | STRING |
Complex Parquet data types are not supported as of now.
options, output
Constructor and Description |
---|
ReadParquet()
Reads an empty source with default settings.
|
ReadParquet(ByteSource source)
Reads the specified data source using default options.
|
ReadParquet(Path path)
Reads the file specified by the path.
|
ReadParquet(String pattern)
Reads all paths matching the specified pattern using default options.
|
Modifier and Type | Method and Description |
---|---|
protected DataFormat |
computeFormat(CompositionContext ctx)
Determines the data format for the source.
|
ParquetMetadata |
discoverMetadata(FileClient client)
Gets the metadata for the currently configured data source.
|
compose, getExtraFieldAction, getFieldErrorAction, getFieldLengthThreshold, getIncludeSourceInfo, getMissingFieldAction, getOutput, getParseOptions, getPessimisticSplitting, getReadBuffer, getReadOnClient, getRecordWarningThreshold, getSelectedFields, getSource, getSplitOptions, getUseMetadata, setExtraFieldAction, setFieldErrorAction, setFieldLengthThreshold, setIncludeSourceInfo, setMissingFieldAction, setParseErrorAction, setParseOptions, setPessimisticSplitting, setReadBuffer, setReadOnClient, setRecordWarningThreshold, setSelectedFields, setSelectedFields, setSource, setSource, setSource, setSplitOptions, setUseMetadata
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
disableParallelism, getInputPorts, getOutputPorts
public ReadParquet()
AbstractReader.setSource(ByteSource)
public ReadParquet(String pattern)
pattern
- a path-matching patternFileClient.matchPaths(String)
public ReadParquet(Path path)
path
- the path to readpublic ReadParquet(ByteSource source)
source
- the data source to readprotected DataFormat computeFormat(CompositionContext ctx)
AbstractReader
ReadSource
operator. If an
implementation supports schema discovery, it must be
performed in this method.computeFormat
in class AbstractReader
ctx
- the composition context for the current invocation
of AbstractReader.compose(CompositionContext)
public ParquetMetadata discoverMetadata(FileClient client)
client
- the file clientCopyright © 2024 Actian Corporation. All rights reserved.