Class ReadParquet

  • All Implemented Interfaces:
    LogicalOperator, RecordSourceOperator, SourceOperator<RecordPort>

    public class ReadParquet
    extends AbstractReader
    Reads data previously written using Apache Parquet format by Apache Hive.

    Parquet is a columnar file format used to store the tabular form of data. Parquet supports very efficient compression and encoding schemes and it allows compression schemes to be specified on per-column level. Parquet supports Compression Codecs like SNAPPY, GZIP, LZO. As Parquet is a representation of data in columnar format, it is supported by open source projects like Apache Hadoop ( MapReduce ), Apache Hive, Impala, etc;

    DataFlow will automatically determine the equivalent data types from Parquet. The result will be the output type of the reader. However, as Parquet and DataFlow support different data types, not all data in Parquet format can be read. If attempting to read data which cannot be represented in DataFlow, an error will be raised.

    Primitive Parquet types are mapped to DataFlow as indicated in the table below.

    Parquet Type DataFlow Type
    BOOLEAN BOOLEAN
    DOUBLE DOUBLE
    FLOAT FLOAT
    INT32 INT
    INT64 LONG
    BINARY STRING

    Complex Parquet data types are not supported as of now.

    • Constructor Detail

      • ReadParquet

        public ReadParquet()
        Reads an empty source with default settings. The source must be set before execution or an error will be raised.
        See Also:
        AbstractReader.setSource(ByteSource)
      • ReadParquet

        public ReadParquet​(String pattern)
        Reads all paths matching the specified pattern using default options. Any matching path which is a directory is replaced with all files in the directory; this expansion is not recursive.
        Parameters:
        pattern - a path-matching pattern
        See Also:
        FileClient.matchPaths(String)
      • ReadParquet

        public ReadParquet​(Path path)
        Reads the file specified by the path. If the path refers to a directory, all files in the directory are read; this read is not recursive into sub-directories.
        Parameters:
        path - the path to read
      • ReadParquet

        public ReadParquet​(ByteSource source)
        Reads the specified data source using default options.
        Parameters:
        source - the data source to read
    • Method Detail

      • discoverMetadata

        public ParquetMetadata discoverMetadata​(FileClient client)
        Gets the metadata for the currently configured data source.
        Parameters:
        client - the file client
        Returns:
        the metadata of the source