Class WriteStagingDataset

  • All Implemented Interfaces:
    LogicalOperator, RecordSinkOperator, SinkOperator<RecordPort>

    public final class WriteStagingDataset
    extends AbstractWriter
    Writes a sequence of records to disk in an internal format for staged data. Staged data sets are useful as they are more efficient than text files, being stored in a compact binary format. If a set of data must be read multiple times, significant savings can be achieved by converting it into a data set first.

    It is generally best to perform parallel writes, creating multiple files. This allows reads to be parallelized effectively, as the staging format is not splitable.

    See Also:
    WriteStagingDataset
    • Constructor Detail

      • WriteStagingDataset

        public WriteStagingDataset()
        Writes to an empty target with default settings. The target must be set before execution or an error will be raised.
        See Also:
        AbstractWriter.setTarget(ByteSink)
      • WriteStagingDataset

        public WriteStagingDataset​(boolean provideDoneSignal)
        Writes an empty target with default settings, optionally providing a port for signaling completion of the write. The target must be set before execution or an error will be raised.
        Parameters:
        provideDoneSignal - indicates whether a done signal port should be created
        See Also:
        AbstractWriter.setTarget(ByteSink)
      • WriteStagingDataset

        public WriteStagingDataset​(String path,
                                   WriteMode mode)
        Writes to the specified path in the given mode, using default options.

        If the writer is parallelized, this is interpreted as a directory in which each partition will write a fragment of the entire input stream. Otherwise, it is interpreted as the file to write.

        Parameters:
        path - the path to which to write
        mode - how to handle existing files
      • WriteStagingDataset

        public WriteStagingDataset​(Path path,
                                   WriteMode mode)
        Writes to the specified path in the given mode, using default options.

        If the writer is parallelized, this is interpreted as a directory in which each partition will write a fragment of the entire input stream. Otherwise, it is interpreted as the file to write.

        Parameters:
        path - the path to which to write
        mode - how to handle existing files
      • WriteStagingDataset

        public WriteStagingDataset​(ByteSink target,
                                   WriteMode mode)
        Writes to the specified target sink using default options.

        The writer can only be parallelized if the sink is fragmentable. In this case, each partition will be written as an independent sink. Otherwise, the writer will run non-parallel.

        Parameters:
        target - the sink to which to write
        mode - how to handle an existing sink
    • Method Detail

      • getFormat

        public DatasetStorageFormat getFormat()
        Gets the data set format used to store data
        Returns:
        the format to use when writing
      • getBlockSize

        public int getBlockSize()
        Gets the block size, in rows, used for encoding data.
        Returns:
        the size of encoded data blocks
      • setBlockSize

        public void setBlockSize​(int blockSize)
        Sets the block size, in rows, used for encoding data. By default, this is 64 rows. This setting is of most importance for DatasetStorageFormat.COLUMNAR.

        Using larger values may increase efficiency, but at a cost of using more memory.

        Parameters:
        blockSize - the size of encoded data blocks
      • discoverMetadata

        public DatasetMetadata discoverMetadata​(FileClient client)
        Gets the metadata for the currently configured data target.
        Parameters:
        client - the file client
        Returns:
        the metadata of the target