com.pervasive.datarush.operators.io.textfile.ReadJSON

All Implemented Interfaces:: LogicalOperator, RecordSourceOperator, SourceOperator<RecordPort>

public class ReadJSON extends AbstractTextReader

The ReadJSON operator reads a JSON file of key-value pairs or array of objects as record tokens. It supports JSON lines format as described at http://jsonlines.org/. JSON lines formatted text has a single JSON record per line with each record separated by a newline separator character

In JSON it is expected that all field keys start and end with a delimiter. A "\"" (double quote) is typically used as the field delimiter. However, the user may enable the property allowSingleQuotes to avoid parsing errors when single quotes are used instead. This operator uses the Jackson JSON parsing library to parse fields.

The reader may optionally specify a RecordTextSchema to provide parsing and type information for fields. The schema, in conjunction with any specified field filter, defines the output type of the reader. This can be manually constructed via the API provided. StructuredSchemaReader provides support for reading in Pervasive DataIntegrator structured schema descriptors (.schema files) for use with readers. Because JSON text has explicit field markers, it is also possible to perform automated discovery of the schema based on the contents of the file. The reader provides a pluggable discovery mechanism to support this function. By default, the schema will be automatically discovered, with all fields assumed to be strings initially. Discovered fields are named using the key fields present.

Normally, the output of the reader includes all parsed records in the file, both those with and without parsing errors. Fields which can not be parsed are null valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.

JSON text will does not contain a header row since the keys in a json record define the fields in the resulting output. JSON text files can be parsed in parallel under "optimistic" assumptions: namely, that the data is well formatted in JSON lines format.

Field Summary

Fields

Modifier and Type

Field

Description

static final int

DEFAULT_ANALYSIS_DEPTH

The default number of lines analyzed when performing schema discovery

Fields inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader
encodingProps

Fields inherited from class com.pervasive.datarush.operators.io.AbstractReader
options, output
Constructor Summary

Constructors

Constructor

Description

ReadJSON()

Reads an empty source with default settings.

ReadJSON(Path path)

Reads the file specified by the path using default options.

ReadJSON(ByteSource source)

Reads the specified data source using default options.

ReadJSON(String pattern)

Reads all paths matching the specified pattern using default options.
Method Summary

Modifier and Type

Method

Description

void

autoConfigure(FileClient ctx)

Performs any configured discovery on the operator using the current source and applies the result to configuration.

ReadJSON

clone()

protected DataFormat

computeFormat(CompositionContext ctx)

Determines the data format for the source.

RecordTextSchema<?>

discoverSchema(FileClient ctx)

Run schema discovery using current configuration.

boolean

getAllowBackslashEscapingAny()

Get whether the parser will allow quoting of all characters using backslash quoting mechanism.

boolean

getAllowComments()

Get whether the parser should allow Java or C++ style comments within the source.

boolean

getAllowNonNumericNumbers()

Get whether the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number values

boolean

getAllowNumericLeadingZeros()

Get whether the parser will allow numbers to start with additional zeroes.

boolean

getAllowSingleQuotes()

Get whether parser will allow use of single quotes (apostrophe, character '\'') for quoting strings.

boolean

getAllowUnquotedControlChars()

Get whether the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).

boolean

getAllowUnquotedFieldNames()

Get whether the parser will allow use of unquoted field names.

int

getAnalysisDepth()

Gets the number of characters to read for schema discovery and structural analysis of the file.

String

getDiscoveryNullIndicator()

Gets the text value used to represent null values by default in discovered schemas.

TextTypes.StringConversion

getDiscoveryStringHandling()

Gets the default behavior for processing string-valued types in discovered schemas.

boolean

getMultilineFormat()

Get whether or not the parser will allow JSON records which span multiple lines

RecordTextSchema<?>

getSchema()

Gets the record schema of the JSON text source.

TextRecordDiscoverer

getSchemaDiscovery()

Gets the schema discoverer to use on the JSON text source.

void

setAllowBackslashEscapingAny(boolean allowBackslashEscapingAny)

Set if the parser will allow quoting of all characters using backslash quoting mechanism.

void

setAllowComments(boolean allowComments)

Set whether the parser should allow comments or not.

void

setAllowNonNumericNumbers(boolean allowNonNumericNumbers)

Set if the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number values

void

setAllowNumericLeadingZeros(boolean allowNumericLeadingZeros)

Sets whether the parser will allow numbers to start with additional zeroes.

void

setAllowSingleQuotes(boolean allowSingleQuotes)

Set whether the parser will allow use of single quotes for quoting strings.

void

setAllowUnquotedControlChars(boolean allowUnquotedControlChars)

Set if the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).

void

setAllowUnquotedFieldNames(boolean allowUnquotedFieldNames)

Set whether the parser will allow use of unquoted field names.

void

setAnalysisDepth(int count)

Sets the number of characters to read for performing schema discovery and structural analysis.

void

setDiscoveryNullIndicator(String value)

Sets the text value used to represent null values by default in discovered schemas.

void

setDiscoveryStringHandling(TextTypes.StringConversion behavior)

Sets the default behavior for processing string-valued types in discovered schemas.

void

setMultilineFormat(boolean multilineFormat)

Sets whether or not the parser will allow JSON records to span multiple lines

void

setSchema(RecordTextSchema<?> schema)

Sets the record schema expected in the JSON text source.

void

setSchemaDiscovery(TextRecordDiscoverer discoverer)

Sets the schema discoverer to use against the JSON text source.

void

setSchemaDiscovery(List<TypePattern> patterns)

Enables schema discovery using the default discoverer extended with additional typing patterns.

Methods inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader
getCharset, getCharsetName, getDecodeBuffer, getEncoding, getErrorAction, getReplacement, setCharset, setCharsetName, setDecodeBuffer, setEncoding, setErrorAction, setReplacement

Methods inherited from class com.pervasive.datarush.operators.io.AbstractReader
compose, getExtraFieldAction, getFieldErrorAction, getFieldLengthThreshold, getIncludeSourceInfo, getMissingFieldAction, getOutput, getParseOptions, getPessimisticSplitting, getReadBuffer, getReadOnClient, getRecordWarningThreshold, getSelectedFields, getSource, getSplitOptions, getUseMetadata, setExtraFieldAction, setFieldErrorAction, setFieldLengthThreshold, setIncludeSourceInfo, setMissingFieldAction, setParseErrorAction, setParseOptions, setPessimisticSplitting, setReadBuffer, setReadOnClient, setRecordWarningThreshold, setSelectedFields, setSelectedFields, setSource, setSource, setSource, setSplitOptions, setUseMetadata

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface com.pervasive.datarush.operators.LogicalOperator
disableParallelism, getInputPorts, getOutputPorts

Field Details
- DEFAULT_ANALYSIS_DEPTH
  
  public static final int DEFAULT_ANALYSIS_DEPTH
  
  The default number of lines analyzed when performing schema discovery
  See Also:
  
  Constant Field Values
Constructor Details
- ReadJSON
  
  public ReadJSON()
  
  Reads an empty source with default settings. The source must be set before execution or an error will be raised.
  A default schema discovery will be run based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema)
  See Also:
  
  AbstractReader.setSource(ByteSource)
- ReadJSON
  
  public ReadJSON(String pattern)
  
  Reads all paths matching the specified pattern using default options. Any matching path which is a directory is replaced with all files in the directory.
  A default schema discovery will be run based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema)
  Parameters:
  
  pattern - a path-matching pattern
  
  See Also:
  
  FileClient.matchPaths(String)
- ReadJSON
  
  public ReadJSON(Path path)
  
  Reads the file specified by the path using default options. If the path refers to a directory, all files in the directory are read.
  A default schema discovery will be run based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema)
  
  Parameters:
  
  path - the path to read
- ReadJSON
  
  public ReadJSON(ByteSource source)
  
  Reads the specified data source using default options.
  A default schema discovery will be run based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema)
  
  Parameters:
  
  source - the data source to read
Method Details
- clone
  
  public ReadJSON clone()
  
  Overrides:
  
  clone in class Object
- getAllowComments
  
  public boolean getAllowComments()
  
  Get whether the parser should allow Java or C++ style comments within the source.
  
  Returns:
  
  the allowComments
- setAllowComments
  
  public void setAllowComments(boolean allowComments)
  
  Set whether the parser should allow comments or not. If the JSON file to be parsed has comments, parser should be set to true to handle comments while parsing. If enabled the parser will allow use of Java or C++ style comments (both '/'+'*' and '//' types) within parsed content or not.
  
  Parameters:
  
  allowComments - sets whether parser will allow comments or not
- getAllowUnquotedFieldNames
  
  public boolean getAllowUnquotedFieldNames()
  
  Get whether the parser will allow use of unquoted field names.
  
  Returns:
  
  the allowUnquotedFieldNames
- setAllowUnquotedFieldNames
  
  public void setAllowUnquotedFieldNames(boolean allowUnquotedFieldNames)
  
  Set whether the parser will allow use of unquoted field names. If unquoted field names are used in source file, this field should be set to true.
  
  Parameters:
  
  allowUnquotedFieldNames - sets whether parser will allow use of unquoted field names
- getAllowSingleQuotes
  
  public boolean getAllowSingleQuotes()
  
  Get whether parser will allow use of single quotes (apostrophe, character '\'') for quoting strings.
  
  Returns:
  
  the allowSingleQuotes
- setAllowSingleQuotes
  
  public void setAllowSingleQuotes(boolean allowSingleQuotes)
  
  Set whether the parser will allow use of single quotes for quoting strings. If single quotes are used in source, this field should be set to true.
  
  Parameters:
  
  allowSingleQuotes - sets whether parser will allow single quotes for quoting strings.
- getAllowNumericLeadingZeros
  
  public boolean getAllowNumericLeadingZeros()
  
  Get whether the parser will allow numbers to start with additional zeroes.
  
  Returns:
  
  the allowNumericLeadingZeros
- setAllowNumericLeadingZeros
  
  public void setAllowNumericLeadingZeros(boolean allowNumericLeadingZeros)
  
  Sets whether the parser will allow numbers to start with additional zeroes. If leading zeroes are allowed for numbers in the source, this field should be set to true.
  
  Parameters:
  
  allowNumericLeadingZeros - sets whether parser will allow leading zeros
- getAllowUnquotedControlChars
  
  public boolean getAllowUnquotedControlChars()
  
  Get whether the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).
  
  Returns:
  
  the allowUnquotedControlChars
- setAllowUnquotedControlChars
  
  public void setAllowUnquotedControlChars(boolean allowUnquotedControlChars)
  
  Set if the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed).
  
  Parameters:
  
  allowUnquotedControlChars - sets whether parser will allow unquoted control characters
- getAllowBackslashEscapingAny
  
  public boolean getAllowBackslashEscapingAny()
  
  Get whether the parser will allow quoting of all characters using backslash quoting mechanism. If not enabled, only characters that are explicitly listed by JSON specification can be escaped.
  
  Returns:
  
  the allowBackslashEscapingAny
- setAllowBackslashEscapingAny
  
  public void setAllowBackslashEscapingAny(boolean allowBackslashEscapingAny)
  
  Set if the parser will allow quoting of all characters using backslash quoting mechanism.
  
  Parameters:
  
  allowBackslashEscapingAny - sets whether backslash escaping is allowed.
- getAllowNonNumericNumbers
  
  public boolean getAllowNonNumericNumbers()
  
  Get whether the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number values
  
  Returns:
  
  the allowNonNumericNumbers
- setAllowNonNumericNumbers
  
  public void setAllowNonNumericNumbers(boolean allowNonNumericNumbers)
  
  Set if the parser recognizes set of "Not a Number" (NaN) tokens as legal floating number values
  
  Parameters:
  
  allowNonNumericNumbers - sets whether non numeric numbers are allowed
- getSchema
  
  public RecordTextSchema<?> getSchema()
  
  Gets the record schema of the JSON text source. If this returns null then schema discovery will be attempted.
  
  Returns:
  
  the record schema of the source
- setSchema
  
  public void setSchema(RecordTextSchema<?> schema)
  
  Sets the record schema expected in the JSON text source. Output records will have this schema, adjusted accordingly for any configured field selection.
  Setting a schema disables schema discovery.
  Parameters:
  
  schema - the expected record schema of the source
  
  See Also:
  
  AbstractReader.setSelectedFields(java.util.List)
- getMultilineFormat
  
  public boolean getMultilineFormat()
  
  Get whether or not the parser will allow JSON records which span multiple lines
  
  Returns:
  
  the multilineFormat
- setMultilineFormat
  
  public void setMultilineFormat(boolean multilineFormat)
  
  Sets whether or not the parser will allow JSON records to span multiple lines
  
  Parameters:
  
  multilineFormat - sets whether multiline JSON records are allowed
- getAnalysisDepth
  
  public int getAnalysisDepth()
  
  Gets the number of characters to read for schema discovery and structural analysis of the file.
  
  Returns:
  
  the number of characters which will be analyzed
- setAnalysisDepth
  
  public void setAnalysisDepth(int count)
  
  Sets the number of characters to read for performing schema discovery and structural analysis. This setting is ignored if no discovery is being performed. The default setting is 1M characters.
  
  Parameters:
  
  count - the number of characters to use to determine the schema and/or file structure
- getSchemaDiscovery
  
  public TextRecordDiscoverer getSchemaDiscovery()
  
  Gets the schema discoverer to use on the JSON text source. If schema discovery is disabled, this will return null.
  
  Returns:
  
  the configured schema discoverer
- setSchemaDiscovery
  
  public void setSchemaDiscovery(TextRecordDiscoverer discoverer)
  
  Sets the schema discoverer to use against the JSON text source. Just prior to graph execution the source will be examined using the discoverer to determine the output schema for records. If reading multiple files, the schema is determined using the first file. Output records will have the discovered schema, adjusted accordingly for any configured field selection.
  By default, the schema will be discovered automatically. All fields are assumed to be strings and the field names are taken from the key values.
  Setting schema discovery overrides any previously configured schema.
  Parameters:
  
  discoverer - the schema discoverer to use.
  
  See Also:
  
  setSchema(RecordTextSchema)
  
  AbstractReader.setSelectedFields(java.util.List)
- setSchemaDiscovery
  
  public void setSchemaDiscovery(List<TypePattern> patterns)
  
  Enables schema discovery using the default discoverer extended with additional typing patterns. The additional patterns are in addition to, not in place of, the normal discovery typing patterns. If overriding default rules is desired, use setSchemaDiscovery(TextRecordDiscoverer) with an appropriately configured discoverer instead.
  Parameters:
  
  patterns - the additional patterns to apply at lower precedence than default patterns
  
  See Also:
  
  PatternBasedDiscovery
- getDiscoveryNullIndicator
  
  public String getDiscoveryNullIndicator()
  
  Gets the text value used to represent null values by default in discovered schemas.
  
  Returns:
  
  the string indicating a null value
- setDiscoveryNullIndicator
  
  public void setDiscoveryNullIndicator(String value)
  
  Sets the text value used to represent null values by default in discovered schemas. By default, this is the empty string. If schema discovery is not enabled, this setting is ignored.
  
  Parameters:
  
  value - the string indicating a null value
- getDiscoveryStringHandling
  
  public TextTypes.StringConversion getDiscoveryStringHandling()
  
  Gets the default behavior for processing string-valued types in discovered schemas.
  
  Returns:
  
  how string-valued types should be converted from text
- setDiscoveryStringHandling
  
  public void setDiscoveryStringHandling(TextTypes.StringConversion behavior)
  
  Sets the default behavior for processing string-valued types in discovered schemas. By default, whitespace is not trimmed from values and the empty string is treated as null. If schema discovery is not enabled, this setting is ignored.
  
  Parameters:
  
  behavior - indicates how string-valued types should be converted from text
- computeFormat
  
  protected DataFormat computeFormat(CompositionContext ctx)
  
  Description copied from class: AbstractReader
  
  Determines the data format for the source. The returned format is used during composition to construct a ReadSource operator. If an implementation supports schema discovery, it must be performed in this method.
  
  Specified by:
  
  computeFormat in class AbstractReader
  
  Parameters:
  
  ctx - the composition context for the current invocation of AbstractReader.compose(CompositionContext)
  
  Returns:
  
  the source format to use
- autoConfigure
  
  public void autoConfigure(FileClient ctx) throws IOException
  Performs any configured discovery on the operator using the current source and applies the result to configuration. After execution, the operator will be configured so not to require any pre-execution analysis of the source. In particular:
  
  A schema will be discovered, if necessary, and set; schema discovery will subsequently be disabled.
  Parameters:
  
  ctx - the authorization context to use for accessing the source
  
  Throws:
  
  IOException - if errors occur during discovery and analysis of the source
- discoverSchema
  
  public RecordTextSchema<?> discoverSchema(FileClient ctx)
  
  Run schema discovery using current configuration.
  
  Parameters:
  
  ctx - the authorization context to use for accessing the file
  
  Returns:
  
  the predicted schema of the source

Class ReadJSON

Field Summary

Fields inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader

Fields inherited from class com.pervasive.datarush.operators.io.AbstractReader

Constructor Summary

Method Summary

Methods inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader

Methods inherited from class com.pervasive.datarush.operators.io.AbstractReader

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator

Methods inherited from class java.lang.Object

Methods inherited from interface com.pervasive.datarush.operators.LogicalOperator

Field Details

DEFAULT_ANALYSIS_DEPTH

Constructor Details

ReadJSON

ReadJSON

ReadJSON

ReadJSON

Method Details

clone

getAllowComments

setAllowComments

getAllowUnquotedFieldNames

setAllowUnquotedFieldNames

getAllowSingleQuotes

setAllowSingleQuotes

getAllowNumericLeadingZeros

setAllowNumericLeadingZeros

getAllowUnquotedControlChars

setAllowUnquotedControlChars

getAllowBackslashEscapingAny

setAllowBackslashEscapingAny

getAllowNonNumericNumbers

setAllowNonNumericNumbers

getSchema

setSchema

getMultilineFormat

setMultilineFormat

getAnalysisDepth

setAnalysisDepth

getSchemaDiscovery

setSchemaDiscovery

setSchemaDiscovery

getDiscoveryNullIndicator

setDiscoveryNullIndicator

getDiscoveryStringHandling

setDiscoveryStringHandling

computeFormat

autoConfigure

discoverSchema