com.actian.dataflow.operators.io.parquet.ReadParquet

All Implemented Interfaces:: LogicalOperator, RecordSourceOperator, SourceOperator<RecordPort>

public class ReadParquet extends AbstractReader

Reads data previously written using Apache Parquet format by Apache Hive.

Parquet is a columnar file format used to store the tabular form of data. Parquet supports very efficient compression and encoding schemes and it allows compression schemes to be specified on per-column level. Parquet supports Compression Codecs like SNAPPY, GZIP, LZO. As Parquet is a representation of data in columnar format, it is supported by open source projects like Apache Hadoop ( MapReduce ), Apache Hive, Impala, etc;

DataFlow will automatically determine the equivalent data types from Parquet. The result will be the output type of the reader. However, as Parquet and DataFlow support different data types, not all data in Parquet format can be read. If attempting to read data which cannot be represented in DataFlow, an error will be raised.

Primitive Parquet types are mapped to DataFlow as indicated in the table below.

Parquet Type	DataFlow Type
BOOLEAN	BOOLEAN
DOUBLE	DOUBLE
FLOAT	FLOAT
INT32	INT
INT64	LONG
BINARY	STRING

Complex Parquet data types are not supported as of now.

Field Summary

Fields inherited from class com.pervasive.datarush.operators.io.AbstractReader
options, output
Constructor Summary

Constructors

Constructor

Description

ReadParquet()

Reads an empty source with default settings.

ReadParquet(Path path)

Reads the file specified by the path.

ReadParquet(ByteSource source)

Reads the specified data source using default options.

ReadParquet(String pattern)

Reads all paths matching the specified pattern using default options.
Method Summary

Modifier and Type

Method

Description

protected DataFormat

computeFormat(CompositionContext ctx)

Determines the data format for the source.

ParquetMetadata

discoverMetadata(FileClient client)

Gets the metadata for the currently configured data source.

Methods inherited from class com.pervasive.datarush.operators.io.AbstractReader
compose, getExtraFieldAction, getFieldErrorAction, getFieldLengthThreshold, getIncludeSourceInfo, getMissingFieldAction, getOutput, getParseOptions, getPessimisticSplitting, getReadBuffer, getReadOnClient, getRecordWarningThreshold, getSelectedFields, getSource, getSplitOptions, getUseMetadata, setExtraFieldAction, setFieldErrorAction, setFieldLengthThreshold, setIncludeSourceInfo, setMissingFieldAction, setParseErrorAction, setParseOptions, setPessimisticSplitting, setReadBuffer, setReadOnClient, setRecordWarningThreshold, setSelectedFields, setSelectedFields, setSource, setSource, setSource, setSplitOptions, setUseMetadata

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface com.pervasive.datarush.operators.LogicalOperator
disableParallelism, getInputPorts, getOutputPorts

Constructor Details
- ReadParquet
  
  public ReadParquet()
  
  Reads an empty source with default settings. The source must be set before execution or an error will be raised.
  See Also:
  
  AbstractReader.setSource(ByteSource)
- ReadParquet
  
  public ReadParquet(String pattern)
  
  Reads all paths matching the specified pattern using default options. Any matching path which is a directory is replaced with all files in the directory; this expansion is not recursive.
  Parameters:
  
  pattern - a path-matching pattern
  
  See Also:
  
  FileClient.matchPaths(String)
- ReadParquet
  
  public ReadParquet(Path path)
  
  Reads the file specified by the path. If the path refers to a directory, all files in the directory are read; this read is not recursive into sub-directories.
  
  Parameters:
  
  path - the path to read
- ReadParquet
  
  public ReadParquet(ByteSource source)
  
  Reads the specified data source using default options.
  
  Parameters:
  
  source - the data source to read
Method Details
- computeFormat
  
  protected DataFormat computeFormat(CompositionContext ctx)
  
  Description copied from class: AbstractReader
  
  Determines the data format for the source. The returned format is used during composition to construct a ReadSource operator. If an implementation supports schema discovery, it must be performed in this method.
  
  Specified by:
  
  computeFormat in class AbstractReader
  
  Parameters:
  
  ctx - the composition context for the current invocation of AbstractReader.compose(CompositionContext)
  
  Returns:
  
  the source format to use
- discoverMetadata
  
  public ParquetMetadata discoverMetadata(FileClient client)
  
  Gets the metadata for the currently configured data source.
  
  Parameters:
  
  client - the file client
  
  Returns:
  
  the metadata of the source

Class ReadParquet

Field Summary

Fields inherited from class com.pervasive.datarush.operators.io.AbstractReader

Constructor Summary

Method Summary

Methods inherited from class com.pervasive.datarush.operators.io.AbstractReader

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator

Methods inherited from class java.lang.Object

Methods inherited from interface com.pervasive.datarush.operators.LogicalOperator

Constructor Details

ReadParquet

ReadParquet

ReadParquet

ReadParquet

Method Details

computeFormat

discoverMetadata