Class ReadStagingDataset

  • All Implemented Interfaces:
    LogicalOperator, RecordSourceOperator, SourceOperator<RecordPort>

    public class ReadStagingDataset
    extends AbstractReader
    Reads a sequence of records previously staged to disk. Staged data sets are useful as they are more efficient than text files, being stored in a compact binary format. If a set of data must be read multiple times, significant savings can be achieved by converting it into a data set first.

    The staged data format is not splitable. To obtain parallelism, perform parallel writes to create a set of files; reads of the multiple files will be fully parallel.

    See Also:
    WriteStagingDataset
    • Constructor Detail

      • ReadStagingDataset

        public ReadStagingDataset()
        Reads an empty source with default settings. The source must be set before execution or an error will be raised.
        See Also:
        AbstractReader.setSource(ByteSource)
      • ReadStagingDataset

        public ReadStagingDataset​(String pattern)
        Reads all paths matching the specified pattern as staged data using default options. Any matching path which is a directory is replaced with all files in the directory; this expansion is not applied recursively.
        Parameters:
        pattern - a path-matching pattern
        See Also:
        FileClient.matchPaths(String)
      • ReadStagingDataset

        public ReadStagingDataset​(Path path)
        Reads the file specified by the path as staged data using default options. If the path refers to a a directory, all files in the directory are read; this expansion is not applied recursively.
        Parameters:
        path - the path to read
      • ReadStagingDataset

        public ReadStagingDataset​(ByteSource source)
        Reads the specified data source using default options.
        Parameters:
        source - the data source to read
    • Method Detail

      • discoverMetadata

        public DatasetMetadata discoverMetadata​(FileClient client)
        Gets the metadata for the currently configured data source.
        Parameters:
        client - the file client
        Returns:
        the metadata of the source