- java.lang.Object
-
- com.pervasive.datarush.operators.AbstractLogicalOperator
-
- com.pervasive.datarush.operators.CompositeOperator
-
- com.pervasive.datarush.operators.io.AbstractReader
-
- com.pervasive.datarush.operators.io.textfile.AbstractTextReader
-
- com.pervasive.datarush.operators.io.textfile.ReadJSON
-
- All Implemented Interfaces:
LogicalOperator
,RecordSourceOperator
,SourceOperator<RecordPort>
public class ReadJSON extends AbstractTextReader
The ReadJSON operator reads a JSON file of key-value pairs or array of objects as record tokens. It supports JSON lines format as described at http://jsonlines.org/. JSON lines formatted text has a single JSON record per line with each record separated by a newline separator characterIn JSON it is expected that all field keys start and end with a delimiter. A "\"" (double quote) is typically used as the field delimiter. However, the user may enable the property allowSingleQuotes to avoid parsing errors when single quotes are used instead. This operator uses the Jackson JSON parsing library to parse fields.
The reader may optionally specify a RecordTextSchema to provide parsing and type information for fields. The schema, in conjunction with any specified field filter, defines the output type of the reader. This can be manually constructed via the API provided. StructuredSchemaReader provides support for reading in Pervasive DataIntegrator structured schema descriptors (.schema files) for use with readers. Because JSON text has explicit field markers, it is also possible to perform automated discovery of the schema based on the contents of the file. The reader provides a pluggable discovery mechanism to support this function. By default, the schema will be automatically discovered, with all fields assumed to be strings initially. Discovered fields are named using the key fields present.
Normally, the output of the reader includes all parsed records in the file, both those with and without parsing errors. Fields which can not be parsed are null valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.
JSON text will does not contain a header row since the keys in a json record define the fields in the resulting output. JSON text files can be parsed in parallel under "optimistic" assumptions: namely, that the data is well formatted in JSON lines format.
-
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_ANALYSIS_DEPTH
The default number of lines analyzed when performing schema discovery-
Fields inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader
encodingProps
-
Fields inherited from class com.pervasive.datarush.operators.io.AbstractReader
options, output
-
-
Constructor Summary
Constructors Constructor Description ReadJSON()
Reads an empty source with default settings.ReadJSON(Path path)
Reads the file specified by the path using default options.ReadJSON(ByteSource source)
Reads the specified data source using default options.ReadJSON(String pattern)
Reads all paths matching the specified pattern using default options.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
autoConfigure(FileClient ctx)
Performs any configured discovery on the operator using the current source and applies the result to configuration.ReadJSON
clone()
protected DataFormat
computeFormat(CompositionContext ctx)
Determines the data format for the source.RecordTextSchema<?>
discoverSchema(FileClient ctx)
Run schema discovery using current configuration.boolean
getAllowBackslashEscapingAny()
Get whether the parser will allow quoting of all characters using backslash quoting mechanism.boolean
getAllowComments()
Get whether the parser should allow Java or C++ style comments within the source.boolean
getAllowNonNumericNumbers()
Get whether the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number valuesboolean
getAllowNumericLeadingZeros()
Get whether the parser will allow numbers to start with additional zeroes.boolean
getAllowSingleQuotes()
Get whether parser will allow use of single quotes (apostrophe, character '\'') for quoting strings.boolean
getAllowUnquotedControlChars()
Get whether the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).boolean
getAllowUnquotedFieldNames()
Get whether the parser will allow use of unquoted field names.int
getAnalysisDepth()
Gets the number of characters to read for schema discovery and structural analysis of the file.String
getDiscoveryNullIndicator()
Gets the text value used to represent null values by default in discovered schemas.TextTypes.StringConversion
getDiscoveryStringHandling()
Gets the default behavior for processing string-valued types in discovered schemas.boolean
getMultilineFormat()
Get whether or not the parser will allow JSON records which span multiple linesRecordTextSchema<?>
getSchema()
Gets the record schema of the JSON text source.TextRecordDiscoverer
getSchemaDiscovery()
Gets the schema discoverer to use on the JSON text source.void
setAllowBackslashEscapingAny(boolean allowBackslashEscapingAny)
Set if the parser will allow quoting of all characters using backslash quoting mechanism.void
setAllowComments(boolean allowComments)
Set whether the parser should allow comments or not.void
setAllowNonNumericNumbers(boolean allowNonNumericNumbers)
Set if the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number valuesvoid
setAllowNumericLeadingZeros(boolean allowNumericLeadingZeros)
Sets whether the parser will allow numbers to start with additional zeroes.void
setAllowSingleQuotes(boolean allowSingleQuotes)
Set whether the parser will allow use of single quotes for quoting strings.void
setAllowUnquotedControlChars(boolean allowUnquotedControlChars)
Set if the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).void
setAllowUnquotedFieldNames(boolean allowUnquotedFieldNames)
Set whether the parser will allow use of unquoted field names.void
setAnalysisDepth(int count)
Sets the number of characters to read for performing schema discovery and structural analysis.void
setDiscoveryNullIndicator(String value)
Sets the text value used to represent null values by default in discovered schemas.void
setDiscoveryStringHandling(TextTypes.StringConversion behavior)
Sets the default behavior for processing string-valued types in discovered schemas.void
setMultilineFormat(boolean multilineFormat)
Sets whether or not the parser will allow JSON records to span multiple linesvoid
setSchema(RecordTextSchema<?> schema)
Sets the record schema expected in the JSON text source.void
setSchemaDiscovery(TextRecordDiscoverer discoverer)
Sets the schema discoverer to use against the JSON text source.void
setSchemaDiscovery(List<TypePattern> patterns)
Enables schema discovery using the default discoverer extended with additional typing patterns.-
Methods inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader
getCharset, getCharsetName, getDecodeBuffer, getEncoding, getErrorAction, getReplacement, setCharset, setCharsetName, setDecodeBuffer, setEncoding, setErrorAction, setReplacement
-
Methods inherited from class com.pervasive.datarush.operators.io.AbstractReader
compose, getExtraFieldAction, getFieldErrorAction, getFieldLengthThreshold, getIncludeSourceInfo, getMissingFieldAction, getOutput, getParseOptions, getPessimisticSplitting, getReadBuffer, getReadOnClient, getRecordWarningThreshold, getSelectedFields, getSource, getSplitOptions, getUseMetadata, setExtraFieldAction, setFieldErrorAction, setFieldLengthThreshold, setIncludeSourceInfo, setMissingFieldAction, setParseErrorAction, setParseOptions, setPessimisticSplitting, setReadBuffer, setReadOnClient, setRecordWarningThreshold, setSelectedFields, setSelectedFields, setSource, setSource, setSource, setSplitOptions, setUseMetadata
-
Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.pervasive.datarush.operators.LogicalOperator
disableParallelism, getInputPorts, getOutputPorts
-
-
-
-
Field Detail
-
DEFAULT_ANALYSIS_DEPTH
public static final int DEFAULT_ANALYSIS_DEPTH
The default number of lines analyzed when performing schema discovery- See Also:
- Constant Field Values
-
-
Constructor Detail
-
ReadJSON
public ReadJSON()
Reads an empty source with default settings. The source must be set before execution or an error will be raised.A default schema discovery will be run based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)
- See Also:
AbstractReader.setSource(ByteSource)
-
ReadJSON
public ReadJSON(String pattern)
Reads all paths matching the specified pattern using default options. Any matching path which is a directory is replaced with all files in the directory.A default schema discovery will be run based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)
- Parameters:
pattern
- a path-matching pattern- See Also:
FileClient.matchPaths(String)
-
ReadJSON
public ReadJSON(Path path)
Reads the file specified by the path using default options. If the path refers to a directory, all files in the directory are read.A default schema discovery will be run based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)
- Parameters:
path
- the path to read
-
ReadJSON
public ReadJSON(ByteSource source)
Reads the specified data source using default options.A default schema discovery will be run based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)
- Parameters:
source
- the data source to read
-
-
Method Detail
-
getAllowComments
public boolean getAllowComments()
Get whether the parser should allow Java or C++ style comments within the source.- Returns:
- the allowComments
-
setAllowComments
public void setAllowComments(boolean allowComments)
Set whether the parser should allow comments or not. If the JSON file to be parsed has comments, parser should be set to true to handle comments while parsing. If enabled the parser will allow use of Java or C++ style comments (both '/'+'*' and '//' types) within parsed content or not.- Parameters:
allowComments
- sets whether parser will allow comments or not
-
getAllowUnquotedFieldNames
public boolean getAllowUnquotedFieldNames()
Get whether the parser will allow use of unquoted field names.- Returns:
- the allowUnquotedFieldNames
-
setAllowUnquotedFieldNames
public void setAllowUnquotedFieldNames(boolean allowUnquotedFieldNames)
Set whether the parser will allow use of unquoted field names. If unquoted field names are used in source file, this field should be set to true.- Parameters:
allowUnquotedFieldNames
- sets whether parser will allow use of unquoted field names
-
getAllowSingleQuotes
public boolean getAllowSingleQuotes()
Get whether parser will allow use of single quotes (apostrophe, character '\'') for quoting strings.- Returns:
- the allowSingleQuotes
-
setAllowSingleQuotes
public void setAllowSingleQuotes(boolean allowSingleQuotes)
Set whether the parser will allow use of single quotes for quoting strings. If single quotes are used in source, this field should be set to true.- Parameters:
allowSingleQuotes
- sets whether parser will allow single quotes for quoting strings.
-
getAllowNumericLeadingZeros
public boolean getAllowNumericLeadingZeros()
Get whether the parser will allow numbers to start with additional zeroes.- Returns:
- the allowNumericLeadingZeros
-
setAllowNumericLeadingZeros
public void setAllowNumericLeadingZeros(boolean allowNumericLeadingZeros)
Sets whether the parser will allow numbers to start with additional zeroes. If leading zeroes are allowed for numbers in the source, this field should be set to true.- Parameters:
allowNumericLeadingZeros
- sets whether parser will allow leading zeros
-
getAllowUnquotedControlChars
public boolean getAllowUnquotedControlChars()
Get whether the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).- Returns:
- the allowUnquotedControlChars
-
setAllowUnquotedControlChars
public void setAllowUnquotedControlChars(boolean allowUnquotedControlChars)
Set if the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).- Parameters:
allowUnquotedControlChars
- sets whether parser will allow unquoted control characters
-
getAllowBackslashEscapingAny
public boolean getAllowBackslashEscapingAny()
Get whether the parser will allow quoting of all characters using backslash quoting mechanism. If not enabled, only characters that are explicitly listed by JSON specification can be escaped.- Returns:
- the allowBackslashEscapingAny
-
setAllowBackslashEscapingAny
public void setAllowBackslashEscapingAny(boolean allowBackslashEscapingAny)
Set if the parser will allow quoting of all characters using backslash quoting mechanism.- Parameters:
allowBackslashEscapingAny
- sets whether backslash escaping is allowed.
-
getAllowNonNumericNumbers
public boolean getAllowNonNumericNumbers()
Get whether the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number values- Returns:
- the allowNonNumericNumbers
-
setAllowNonNumericNumbers
public void setAllowNonNumericNumbers(boolean allowNonNumericNumbers)
Set if the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number values- Parameters:
allowNonNumericNumbers
- sets whether non numeric numbers are allowed
-
getSchema
public RecordTextSchema<?> getSchema()
Gets the record schema of the JSON text source. If this returnsnull
then schema discovery will be attempted.- Returns:
- the record schema of the source
-
setSchema
public void setSchema(RecordTextSchema<?> schema)
Sets the record schema expected in the JSON text source. Output records will have this schema, adjusted accordingly for any configured field selection.Setting a schema disables schema discovery.
- Parameters:
schema
- the expected record schema of the source- See Also:
AbstractReader.setSelectedFields(java.util.List)
-
getMultilineFormat
public boolean getMultilineFormat()
Get whether or not the parser will allow JSON records which span multiple lines- Returns:
- the multilineFormat
-
setMultilineFormat
public void setMultilineFormat(boolean multilineFormat)
Sets whether or not the parser will allow JSON records to span multiple lines- Parameters:
multilineFormat
- sets whether multiline JSON records are allowed
-
getAnalysisDepth
public int getAnalysisDepth()
Gets the number of characters to read for schema discovery and structural analysis of the file.- Returns:
- the number of characters which will be analyzed
-
setAnalysisDepth
public void setAnalysisDepth(int count)
Sets the number of characters to read for performing schema discovery and structural analysis. This setting is ignored if no discovery is being performed. The default setting is 1M characters.- Parameters:
count
- the number of characters to use to determine the schema and/or file structure
-
getSchemaDiscovery
public TextRecordDiscoverer getSchemaDiscovery()
Gets the schema discoverer to use on the JSON text source. If schema discovery is disabled, this will returnnull
.- Returns:
- the configured schema discoverer
-
setSchemaDiscovery
public void setSchemaDiscovery(TextRecordDiscoverer discoverer)
Sets the schema discoverer to use against the JSON text source. Just prior to graph execution the source will be examined using the discoverer to determine the output schema for records. If reading multiple files, the schema is determined using the first file. Output records will have the discovered schema, adjusted accordingly for any configured field selection.By default, the schema will be discovered automatically. All fields are assumed to be strings and the field names are taken from the key values.
Setting schema discovery overrides any previously configured schema.
- Parameters:
discoverer
- the schema discoverer to use.- See Also:
setSchema(RecordTextSchema)
,AbstractReader.setSelectedFields(java.util.List)
-
setSchemaDiscovery
public void setSchemaDiscovery(List<TypePattern> patterns)
Enables schema discovery using the default discoverer extended with additional typing patterns. The additional patterns are in addition to, not in place of, the normal discovery typing patterns. If overriding default rules is desired, usesetSchemaDiscovery(TextRecordDiscoverer)
with an appropriately configured discoverer instead.- Parameters:
patterns
- the additional patterns to apply at lower precedence than default patterns- See Also:
PatternBasedDiscovery
-
getDiscoveryNullIndicator
public String getDiscoveryNullIndicator()
Gets the text value used to represent null values by default in discovered schemas.- Returns:
- the string indicating a null value
-
setDiscoveryNullIndicator
public void setDiscoveryNullIndicator(String value)
Sets the text value used to represent null values by default in discovered schemas. By default, this is the empty string. If schema discovery is not enabled, this setting is ignored.- Parameters:
value
- the string indicating a null value
-
getDiscoveryStringHandling
public TextTypes.StringConversion getDiscoveryStringHandling()
Gets the default behavior for processing string-valued types in discovered schemas.- Returns:
- how string-valued types should be converted from text
-
setDiscoveryStringHandling
public void setDiscoveryStringHandling(TextTypes.StringConversion behavior)
Sets the default behavior for processing string-valued types in discovered schemas. By default, whitespace is not trimmed from values and the empty string is treated as null. If schema discovery is not enabled, this setting is ignored.- Parameters:
behavior
- indicates how string-valued types should be converted from text
-
computeFormat
protected DataFormat computeFormat(CompositionContext ctx)
Description copied from class:AbstractReader
Determines the data format for the source. The returned format is used during composition to construct aReadSource
operator. If an implementation supports schema discovery, it must be performed in this method.- Specified by:
computeFormat
in classAbstractReader
- Parameters:
ctx
- the composition context for the current invocation ofAbstractReader.compose(CompositionContext)
- Returns:
- the source format to use
-
autoConfigure
public void autoConfigure(FileClient ctx) throws IOException
Performs any configured discovery on the operator using the current source and applies the result to configuration. After execution, the operator will be configured so not to require any pre-execution analysis of the source. In particular:- A schema will be discovered, if necessary, and set; schema discovery will subsequently be disabled.
- Parameters:
ctx
- the authorization context to use for accessing the source- Throws:
IOException
- if errors occur during discovery and analysis of the source
-
discoverSchema
public RecordTextSchema<?> discoverSchema(FileClient ctx)
Run schema discovery using current configuration.- Parameters:
ctx
- the authorization context to use for accessing the file- Returns:
- the predicted schema of the source
-
-