Class ReadJSON

  • All Implemented Interfaces:
    LogicalOperator, RecordSourceOperator, SourceOperator<RecordPort>

    public class ReadJSON
    extends AbstractTextReader
    The ReadJSON operator reads a JSON file of key-value pairs or array of objects as record tokens. It supports JSON lines format as described at http://jsonlines.org/. JSON lines formatted text has a single JSON record per line with each record separated by a newline separator character

    In JSON it is expected that all field keys start and end with a delimiter. A "\"" (double quote) is typically used as the field delimiter. However, the user may enable the property allowSingleQuotes to avoid parsing errors when single quotes are used instead. This operator uses the Jackson JSON parsing library to parse fields.

    The reader may optionally specify a RecordTextSchema to provide parsing and type information for fields. The schema, in conjunction with any specified field filter, defines the output type of the reader. This can be manually constructed via the API provided. StructuredSchemaReader provides support for reading in Pervasive DataIntegrator structured schema descriptors (.schema files) for use with readers. Because JSON text has explicit field markers, it is also possible to perform automated discovery of the schema based on the contents of the file. The reader provides a pluggable discovery mechanism to support this function. By default, the schema will be automatically discovered, with all fields assumed to be strings initially. Discovered fields are named using the key fields present.

    Normally, the output of the reader includes all parsed records in the file, both those with and without parsing errors. Fields which can not be parsed are null valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.

    JSON text will does not contain a header row since the keys in a json record define the fields in the resulting output. JSON text files can be parsed in parallel under "optimistic" assumptions: namely, that the data is well formatted in JSON lines format.

    • Field Detail

      • DEFAULT_ANALYSIS_DEPTH

        public static final int DEFAULT_ANALYSIS_DEPTH
        The default number of lines analyzed when performing schema discovery
        See Also:
        Constant Field Values
    • Constructor Detail

      • ReadJSON

        public ReadJSON()
        Reads an empty source with default settings. The source must be set before execution or an error will be raised.

        A default schema discovery will be run based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema)

        See Also:
        AbstractReader.setSource(ByteSource)
      • ReadJSON

        public ReadJSON​(String pattern)
        Reads all paths matching the specified pattern using default options. Any matching path which is a directory is replaced with all files in the directory.

        A default schema discovery will be run based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema)

        Parameters:
        pattern - a path-matching pattern
        See Also:
        FileClient.matchPaths(String)
      • ReadJSON

        public ReadJSON​(Path path)
        Reads the file specified by the path using default options. If the path refers to a directory, all files in the directory are read.

        A default schema discovery will be run based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema)

        Parameters:
        path - the path to read
      • ReadJSON

        public ReadJSON​(ByteSource source)
        Reads the specified data source using default options.

        A default schema discovery will be run based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema)

        Parameters:
        source - the data source to read
    • Method Detail

      • getAllowComments

        public boolean getAllowComments()
        Get whether the parser should allow Java or C++ style comments within the source.
        Returns:
        the allowComments
      • setAllowComments

        public void setAllowComments​(boolean allowComments)
        Set whether the parser should allow comments or not. If the JSON file to be parsed has comments, parser should be set to true to handle comments while parsing. If enabled the parser will allow use of Java or C++ style comments (both '/'+'*' and '//' types) within parsed content or not.
        Parameters:
        allowComments - sets whether parser will allow comments or not
      • getAllowUnquotedFieldNames

        public boolean getAllowUnquotedFieldNames()
        Get whether the parser will allow use of unquoted field names.
        Returns:
        the allowUnquotedFieldNames
      • setAllowUnquotedFieldNames

        public void setAllowUnquotedFieldNames​(boolean allowUnquotedFieldNames)
        Set whether the parser will allow use of unquoted field names. If unquoted field names are used in source file, this field should be set to true.
        Parameters:
        allowUnquotedFieldNames - sets whether parser will allow use of unquoted field names
      • getAllowSingleQuotes

        public boolean getAllowSingleQuotes()
        Get whether parser will allow use of single quotes (apostrophe, character '\'') for quoting strings.
        Returns:
        the allowSingleQuotes
      • setAllowSingleQuotes

        public void setAllowSingleQuotes​(boolean allowSingleQuotes)
        Set whether the parser will allow use of single quotes for quoting strings. If single quotes are used in source, this field should be set to true.
        Parameters:
        allowSingleQuotes - sets whether parser will allow single quotes for quoting strings.
      • getAllowNumericLeadingZeros

        public boolean getAllowNumericLeadingZeros()
        Get whether the parser will allow numbers to start with additional zeroes.
        Returns:
        the allowNumericLeadingZeros
      • setAllowNumericLeadingZeros

        public void setAllowNumericLeadingZeros​(boolean allowNumericLeadingZeros)
        Sets whether the parser will allow numbers to start with additional zeroes. If leading zeroes are allowed for numbers in the source, this field should be set to true.
        Parameters:
        allowNumericLeadingZeros - sets whether parser will allow leading zeros
      • getAllowUnquotedControlChars

        public boolean getAllowUnquotedControlChars()
        Get whether the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).
        Returns:
        the allowUnquotedControlChars
      • setAllowUnquotedControlChars

        public void setAllowUnquotedControlChars​(boolean allowUnquotedControlChars)
        Set if the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).
        Parameters:
        allowUnquotedControlChars - sets whether parser will allow unquoted control characters
      • getAllowBackslashEscapingAny

        public boolean getAllowBackslashEscapingAny()
        Get whether the parser will allow quoting of all characters using backslash quoting mechanism. If not enabled, only characters that are explicitly listed by JSON specification can be escaped.
        Returns:
        the allowBackslashEscapingAny
      • setAllowBackslashEscapingAny

        public void setAllowBackslashEscapingAny​(boolean allowBackslashEscapingAny)
        Set if the parser will allow quoting of all characters using backslash quoting mechanism.
        Parameters:
        allowBackslashEscapingAny - sets whether backslash escaping is allowed.
      • getAllowNonNumericNumbers

        public boolean getAllowNonNumericNumbers()
        Get whether the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number values
        Returns:
        the allowNonNumericNumbers
      • setAllowNonNumericNumbers

        public void setAllowNonNumericNumbers​(boolean allowNonNumericNumbers)
        Set if the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number values
        Parameters:
        allowNonNumericNumbers - sets whether non numeric numbers are allowed
      • getSchema

        public RecordTextSchema<?> getSchema()
        Gets the record schema of the JSON text source. If this returns null then schema discovery will be attempted.
        Returns:
        the record schema of the source
      • setSchema

        public void setSchema​(RecordTextSchema<?> schema)
        Sets the record schema expected in the JSON text source. Output records will have this schema, adjusted accordingly for any configured field selection.

        Setting a schema disables schema discovery.

        Parameters:
        schema - the expected record schema of the source
        See Also:
        AbstractReader.setSelectedFields(java.util.List)
      • getMultilineFormat

        public boolean getMultilineFormat()
        Get whether or not the parser will allow JSON records which span multiple lines
        Returns:
        the multilineFormat
      • setMultilineFormat

        public void setMultilineFormat​(boolean multilineFormat)
        Sets whether or not the parser will allow JSON records to span multiple lines
        Parameters:
        multilineFormat - sets whether multiline JSON records are allowed
      • getAnalysisDepth

        public int getAnalysisDepth()
        Gets the number of characters to read for schema discovery and structural analysis of the file.
        Returns:
        the number of characters which will be analyzed
      • setAnalysisDepth

        public void setAnalysisDepth​(int count)
        Sets the number of characters to read for performing schema discovery and structural analysis. This setting is ignored if no discovery is being performed. The default setting is 1M characters.
        Parameters:
        count - the number of characters to use to determine the schema and/or file structure
      • getSchemaDiscovery

        public TextRecordDiscoverer getSchemaDiscovery()
        Gets the schema discoverer to use on the JSON text source. If schema discovery is disabled, this will return null.
        Returns:
        the configured schema discoverer
      • setSchemaDiscovery

        public void setSchemaDiscovery​(TextRecordDiscoverer discoverer)
        Sets the schema discoverer to use against the JSON text source. Just prior to graph execution the source will be examined using the discoverer to determine the output schema for records. If reading multiple files, the schema is determined using the first file. Output records will have the discovered schema, adjusted accordingly for any configured field selection.

        By default, the schema will be discovered automatically. All fields are assumed to be strings and the field names are taken from the key values.

        Setting schema discovery overrides any previously configured schema.

        Parameters:
        discoverer - the schema discoverer to use.
        See Also:
        setSchema(RecordTextSchema), AbstractReader.setSelectedFields(java.util.List)
      • setSchemaDiscovery

        public void setSchemaDiscovery​(List<TypePattern> patterns)
        Enables schema discovery using the default discoverer extended with additional typing patterns. The additional patterns are in addition to, not in place of, the normal discovery typing patterns. If overriding default rules is desired, use setSchemaDiscovery(TextRecordDiscoverer) with an appropriately configured discoverer instead.
        Parameters:
        patterns - the additional patterns to apply at lower precedence than default patterns
        See Also:
        PatternBasedDiscovery
      • getDiscoveryNullIndicator

        public String getDiscoveryNullIndicator()
        Gets the text value used to represent null values by default in discovered schemas.
        Returns:
        the string indicating a null value
      • setDiscoveryNullIndicator

        public void setDiscoveryNullIndicator​(String value)
        Sets the text value used to represent null values by default in discovered schemas. By default, this is the empty string. If schema discovery is not enabled, this setting is ignored.
        Parameters:
        value - the string indicating a null value
      • getDiscoveryStringHandling

        public TextTypes.StringConversion getDiscoveryStringHandling()
        Gets the default behavior for processing string-valued types in discovered schemas.
        Returns:
        how string-valued types should be converted from text
      • setDiscoveryStringHandling

        public void setDiscoveryStringHandling​(TextTypes.StringConversion behavior)
        Sets the default behavior for processing string-valued types in discovered schemas. By default, whitespace is not trimmed from values and the empty string is treated as null. If schema discovery is not enabled, this setting is ignored.
        Parameters:
        behavior - indicates how string-valued types should be converted from text
      • autoConfigure

        public void autoConfigure​(FileClient ctx)
                           throws IOException
        Performs any configured discovery on the operator using the current source and applies the result to configuration. After execution, the operator will be configured so not to require any pre-execution analysis of the source. In particular:
        • A schema will be discovered, if necessary, and set; schema discovery will subsequently be disabled.
        Parameters:
        ctx - the authorization context to use for accessing the source
        Throws:
        IOException - if errors occur during discovery and analysis of the source
      • discoverSchema

        public RecordTextSchema<?> discoverSchema​(FileClient ctx)
        Run schema discovery using current configuration.
        Parameters:
        ctx - the authorization context to use for accessing the file
        Returns:
        the predicted schema of the source