- All Implemented Interfaces:
LogicalOperator,RecordSourceOperator,SourceOperator<RecordPort>
In JSON it is expected that all field keys start and end with a delimiter. A "\"" (double quote) is typically used as the field delimiter. However, the user may enable the property allowSingleQuotes to avoid parsing errors when single quotes are used instead. This operator uses the Jackson JSON parsing library to parse fields.
The reader may optionally specify a RecordTextSchema to provide parsing and type information for fields. The schema, in conjunction with any specified field filter, defines the output type of the reader. This can be manually constructed via the API provided. StructuredSchemaReader provides support for reading in Pervasive DataIntegrator structured schema descriptors (.schema files) for use with readers. Because JSON text has explicit field markers, it is also possible to perform automated discovery of the schema based on the contents of the file. The reader provides a pluggable discovery mechanism to support this function. By default, the schema will be automatically discovered, with all fields assumed to be strings initially. Discovered fields are named using the key fields present.
Normally, the output of the reader includes all parsed records in the file, both those with and without parsing errors. Fields which can not be parsed are null valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.
JSON text will does not contain a header row since the keys in a json record define the fields in the resulting output. JSON text files can be parsed in parallel under "optimistic" assumptions: namely, that the data is well formatted in JSON lines format.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intThe default number of lines analyzed when performing schema discoveryFields inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader
encodingPropsFields inherited from class com.pervasive.datarush.operators.io.AbstractReader
options, output -
Constructor Summary
ConstructorsConstructorDescriptionReadJSON()Reads an empty source with default settings.Reads the file specified by the path using default options.ReadJSON(ByteSource source) Reads the specified data source using default options.Reads all paths matching the specified pattern using default options. -
Method Summary
Modifier and TypeMethodDescriptionvoidautoConfigure(FileClient ctx) Performs any configured discovery on the operator using the current source and applies the result to configuration.clone()protected DataFormatDetermines the data format for the source.discoverSchema(FileClient ctx) Run schema discovery using current configuration.booleanGet whether the parser will allow quoting of all characters using backslash quoting mechanism.booleanGet whether the parser should allow Java or C++ style comments within the source.booleanGet whether the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number valuesbooleanGet whether the parser will allow numbers to start with additional zeroes.booleanGet whether parser will allow use of single quotes (apostrophe, character '\'') for quoting strings.booleanGet whether the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).booleanGet whether the parser will allow use of unquoted field names.intGets the number of characters to read for schema discovery and structural analysis of the file.Gets the text value used to represent null values by default in discovered schemas.Gets the default behavior for processing string-valued types in discovered schemas.booleanGet whether or not the parser will allow JSON records which span multiple linesGets the record schema of the JSON text source.Gets the schema discoverer to use on the JSON text source.voidsetAllowBackslashEscapingAny(boolean allowBackslashEscapingAny) Set if the parser will allow quoting of all characters using backslash quoting mechanism.voidsetAllowComments(boolean allowComments) Set whether the parser should allow comments or not.voidsetAllowNonNumericNumbers(boolean allowNonNumericNumbers) Set if the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number valuesvoidsetAllowNumericLeadingZeros(boolean allowNumericLeadingZeros) Sets whether the parser will allow numbers to start with additional zeroes.voidsetAllowSingleQuotes(boolean allowSingleQuotes) Set whether the parser will allow use of single quotes for quoting strings.voidsetAllowUnquotedControlChars(boolean allowUnquotedControlChars) Set if the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).voidsetAllowUnquotedFieldNames(boolean allowUnquotedFieldNames) Set whether the parser will allow use of unquoted field names.voidsetAnalysisDepth(int count) Sets the number of characters to read for performing schema discovery and structural analysis.voidsetDiscoveryNullIndicator(String value) Sets the text value used to represent null values by default in discovered schemas.voidSets the default behavior for processing string-valued types in discovered schemas.voidsetMultilineFormat(boolean multilineFormat) Sets whether or not the parser will allow JSON records to span multiple linesvoidsetSchema(RecordTextSchema<?> schema) Sets the record schema expected in the JSON text source.voidsetSchemaDiscovery(TextRecordDiscoverer discoverer) Sets the schema discoverer to use against the JSON text source.voidsetSchemaDiscovery(List<TypePattern> patterns) Enables schema discovery using the default discoverer extended with additional typing patterns.Methods inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader
getCharset, getCharsetName, getDecodeBuffer, getEncoding, getErrorAction, getReplacement, setCharset, setCharsetName, setDecodeBuffer, setEncoding, setErrorAction, setReplacementMethods inherited from class com.pervasive.datarush.operators.io.AbstractReader
compose, getExtraFieldAction, getFieldErrorAction, getFieldLengthThreshold, getIncludeSourceInfo, getMissingFieldAction, getOutput, getParseOptions, getPessimisticSplitting, getReadBuffer, getReadOnClient, getRecordWarningThreshold, getSelectedFields, getSource, getSplitOptions, getUseMetadata, setExtraFieldAction, setFieldErrorAction, setFieldLengthThreshold, setIncludeSourceInfo, setMissingFieldAction, setParseErrorAction, setParseOptions, setPessimisticSplitting, setReadBuffer, setReadOnClient, setRecordWarningThreshold, setSelectedFields, setSelectedFields, setSource, setSource, setSource, setSplitOptions, setUseMetadataMethods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyErrorMethods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface com.pervasive.datarush.operators.LogicalOperator
disableParallelism, getInputPorts, getOutputPorts
-
Field Details
-
DEFAULT_ANALYSIS_DEPTH
public static final int DEFAULT_ANALYSIS_DEPTHThe default number of lines analyzed when performing schema discovery- See Also:
-
-
Constructor Details
-
ReadJSON
public ReadJSON()Reads an empty source with default settings. The source must be set before execution or an error will be raised.A default schema discovery will be run based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)- See Also:
-
ReadJSON
Reads all paths matching the specified pattern using default options. Any matching path which is a directory is replaced with all files in the directory.A default schema discovery will be run based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)- Parameters:
pattern- a path-matching pattern- See Also:
-
ReadJSON
Reads the file specified by the path using default options. If the path refers to a directory, all files in the directory are read.A default schema discovery will be run based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)- Parameters:
path- the path to read
-
ReadJSON
Reads the specified data source using default options.A default schema discovery will be run based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)- Parameters:
source- the data source to read
-
-
Method Details
-
clone
-
getAllowComments
public boolean getAllowComments()Get whether the parser should allow Java or C++ style comments within the source.- Returns:
- the allowComments
-
setAllowComments
public void setAllowComments(boolean allowComments) Set whether the parser should allow comments or not. If the JSON file to be parsed has comments, parser should be set to true to handle comments while parsing. If enabled the parser will allow use of Java or C++ style comments (both '/'+'*' and '//' types) within parsed content or not.- Parameters:
allowComments- sets whether parser will allow comments or not
-
getAllowUnquotedFieldNames
public boolean getAllowUnquotedFieldNames()Get whether the parser will allow use of unquoted field names.- Returns:
- the allowUnquotedFieldNames
-
setAllowUnquotedFieldNames
public void setAllowUnquotedFieldNames(boolean allowUnquotedFieldNames) Set whether the parser will allow use of unquoted field names. If unquoted field names are used in source file, this field should be set to true.- Parameters:
allowUnquotedFieldNames- sets whether parser will allow use of unquoted field names
-
getAllowSingleQuotes
public boolean getAllowSingleQuotes()Get whether parser will allow use of single quotes (apostrophe, character '\'') for quoting strings.- Returns:
- the allowSingleQuotes
-
setAllowSingleQuotes
public void setAllowSingleQuotes(boolean allowSingleQuotes) Set whether the parser will allow use of single quotes for quoting strings. If single quotes are used in source, this field should be set to true.- Parameters:
allowSingleQuotes- sets whether parser will allow single quotes for quoting strings.
-
getAllowNumericLeadingZeros
public boolean getAllowNumericLeadingZeros()Get whether the parser will allow numbers to start with additional zeroes.- Returns:
- the allowNumericLeadingZeros
-
setAllowNumericLeadingZeros
public void setAllowNumericLeadingZeros(boolean allowNumericLeadingZeros) Sets whether the parser will allow numbers to start with additional zeroes. If leading zeroes are allowed for numbers in the source, this field should be set to true.- Parameters:
allowNumericLeadingZeros- sets whether parser will allow leading zeros
-
getAllowUnquotedControlChars
public boolean getAllowUnquotedControlChars()Get whether the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).- Returns:
- the allowUnquotedControlChars
-
setAllowUnquotedControlChars
public void setAllowUnquotedControlChars(boolean allowUnquotedControlChars) Set if the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).- Parameters:
allowUnquotedControlChars- sets whether parser will allow unquoted control characters
-
getAllowBackslashEscapingAny
public boolean getAllowBackslashEscapingAny()Get whether the parser will allow quoting of all characters using backslash quoting mechanism. If not enabled, only characters that are explicitly listed by JSON specification can be escaped.- Returns:
- the allowBackslashEscapingAny
-
setAllowBackslashEscapingAny
public void setAllowBackslashEscapingAny(boolean allowBackslashEscapingAny) Set if the parser will allow quoting of all characters using backslash quoting mechanism.- Parameters:
allowBackslashEscapingAny- sets whether backslash escaping is allowed.
-
getAllowNonNumericNumbers
public boolean getAllowNonNumericNumbers()Get whether the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number values- Returns:
- the allowNonNumericNumbers
-
setAllowNonNumericNumbers
public void setAllowNonNumericNumbers(boolean allowNonNumericNumbers) Set if the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number values- Parameters:
allowNonNumericNumbers- sets whether non numeric numbers are allowed
-
getSchema
Gets the record schema of the JSON text source. If this returnsnullthen schema discovery will be attempted.- Returns:
- the record schema of the source
-
setSchema
Sets the record schema expected in the JSON text source. Output records will have this schema, adjusted accordingly for any configured field selection.Setting a schema disables schema discovery.
- Parameters:
schema- the expected record schema of the source- See Also:
-
getMultilineFormat
public boolean getMultilineFormat()Get whether or not the parser will allow JSON records which span multiple lines- Returns:
- the multilineFormat
-
setMultilineFormat
public void setMultilineFormat(boolean multilineFormat) Sets whether or not the parser will allow JSON records to span multiple lines- Parameters:
multilineFormat- sets whether multiline JSON records are allowed
-
getAnalysisDepth
public int getAnalysisDepth()Gets the number of characters to read for schema discovery and structural analysis of the file.- Returns:
- the number of characters which will be analyzed
-
setAnalysisDepth
public void setAnalysisDepth(int count) Sets the number of characters to read for performing schema discovery and structural analysis. This setting is ignored if no discovery is being performed. The default setting is 1M characters.- Parameters:
count- the number of characters to use to determine the schema and/or file structure
-
getSchemaDiscovery
Gets the schema discoverer to use on the JSON text source. If schema discovery is disabled, this will returnnull.- Returns:
- the configured schema discoverer
-
setSchemaDiscovery
Sets the schema discoverer to use against the JSON text source. Just prior to graph execution the source will be examined using the discoverer to determine the output schema for records. If reading multiple files, the schema is determined using the first file. Output records will have the discovered schema, adjusted accordingly for any configured field selection.By default, the schema will be discovered automatically. All fields are assumed to be strings and the field names are taken from the key values.
Setting schema discovery overrides any previously configured schema.
- Parameters:
discoverer- the schema discoverer to use.- See Also:
-
setSchemaDiscovery
Enables schema discovery using the default discoverer extended with additional typing patterns. The additional patterns are in addition to, not in place of, the normal discovery typing patterns. If overriding default rules is desired, usesetSchemaDiscovery(TextRecordDiscoverer)with an appropriately configured discoverer instead.- Parameters:
patterns- the additional patterns to apply at lower precedence than default patterns- See Also:
-
getDiscoveryNullIndicator
Gets the text value used to represent null values by default in discovered schemas.- Returns:
- the string indicating a null value
-
setDiscoveryNullIndicator
Sets the text value used to represent null values by default in discovered schemas. By default, this is the empty string. If schema discovery is not enabled, this setting is ignored.- Parameters:
value- the string indicating a null value
-
getDiscoveryStringHandling
Gets the default behavior for processing string-valued types in discovered schemas.- Returns:
- how string-valued types should be converted from text
-
setDiscoveryStringHandling
Sets the default behavior for processing string-valued types in discovered schemas. By default, whitespace is not trimmed from values and the empty string is treated as null. If schema discovery is not enabled, this setting is ignored.- Parameters:
behavior- indicates how string-valued types should be converted from text
-
computeFormat
Description copied from class:AbstractReaderDetermines the data format for the source. The returned format is used during composition to construct aReadSourceoperator. If an implementation supports schema discovery, it must be performed in this method.- Specified by:
computeFormatin classAbstractReader- Parameters:
ctx- the composition context for the current invocation ofAbstractReader.compose(CompositionContext)- Returns:
- the source format to use
-
autoConfigure
Performs any configured discovery on the operator using the current source and applies the result to configuration. After execution, the operator will be configured so not to require any pre-execution analysis of the source. In particular:- A schema will be discovered, if necessary, and set; schema discovery will subsequently be disabled.
- Parameters:
ctx- the authorization context to use for accessing the source- Throws:
IOException- if errors occur during discovery and analysis of the source
-
discoverSchema
Run schema discovery using current configuration.- Parameters:
ctx- the authorization context to use for accessing the file- Returns:
- the predicted schema of the source
-