Interface DataFormat

  • All Known Implementing Classes:
    ARFFDataFormat, AvroFormat, BinaryFormat, DelimitedTextFormat, FixedTextFormat, JSONFormat, LogDataFormat, MDFFormat, ORCFormat, ParquetFormat

    public interface DataFormat
    Describes the record format of external data, such as in a file. A DataFormat object provides the necessary information for reading and writing external data, converting it to and from records in a dataflow graph.

    Many formats are predefined in the library; an implementation is only required if a new format needs to be defined. Normally, it is not necessary to work directly with formats. Instead, operators are provided which hide the DataFormat object and present a view more appropriate to the specific format. Examples of this technique are the ReadDelimitedText and WriteDelimitedText operators.

    See Also:
    ReadSource, WriteSink
    • Method Detail

      • getType

        RecordTokenType getType()
        Gets the record type associated with the format. Records produced by the associated parser or consumed by the associated formatter will be of this type.

        For many formats, this may be derived from a schema object describing the format layout.

        Returns:
        the format's record type
      • getMetadata

        FileMetadata getMetadata()
        Gets the metadata associated with the format. Records produces by the associated parser or consumed by the associated formatter will use this metadata.
        Returns:
        the format's metadata
      • setMetadata

        void setMetadata​(FileMetadata metadata)
        Sets the metadata associated with the format.
      • readMetadata

        FileMetadata readMetadata​(FileClient fileClient,
                                  ByteSource source)
        Reads the metadata associated with the format.
        Parameters:
        fileClient - client used to read file
        source - location of the files
      • writeMetadata

        void writeMetadata​(FileMetadata metadata,
                           FileClient fileClient,
                           ByteSink target)
        Writes the provided metadata associated with the format.
        Parameters:
        metadata - the metadata to write
        fileClient - client used to write file
        source - location of the files
      • createParser

        DataFormat.DataParser createParser​(ParsingOptions options)
        Create a new parser for the format using the specified parsing options.
        Parameters:
        options - parsing options to use
        Returns:
        a new parser for reading external data
      • createWriter

        DataFormat.DataFormatter createWriter​(FormattingOptions options)
        Create a new writer for the format using the specified formatting options.
        Parameters:
        options - formatting options to use
        Returns:
        a new formatter for writing external data
      • isSplittable

        boolean isSplittable()
        Indicates if the format supports parsing of subsections of a file.

        A format should only return true if it can, at least in some situations, support this sort of parsing. If a format requires reading the entire file, it must return false.

        If a format is not splittable, a file in the format cannot be parsed in parallel; however, individual files can still be parsed independently in parallel, as when reading the contents of a directory or using a file globbing pattern.

        Returns:
        true if the format supports parsing only a portion of the file, false otherwise