- All Implemented Interfaces:
LogicalOperator,RecordSourceOperator,SourceOperator<RecordPort>
- Direct Known Subclasses:
AbstractTextReader,ReadAvro,ReadMDF,ReadORC,ReadParquet,ReadStagingDataset
- a source of bytes, as identified by a
ByteSource. This is most often a file or files, so convenience methods for specifying the source as a file are provided. - common parsing properties, such as record fields to omit in the output and on-error behavior.
- common I/O tunables, such as buffer sizes and splitting behavior
AbstractReader wraps a ReadSource operator.
Implementations provide an interface specific to
an appropriate class of files - delimited text files, as an
example - hiding the more complex model used to describe
data parsing in general. The composition structure is the
same for all readers with only DataFormat used
differing between implementations.-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final ParsingOptionsContainer for options related to record parsingprotected final RecordPortThe output port of the read operator -
Constructor Summary
ConstructorsModifierConstructorDescriptionprotectedReads an empty source with default settings.protectedAbstractReader(Path path) Reads the file specified by the path.protectedAbstractReader(ByteSource source) Reads the specified data source using default options.protectedAbstractReader(String pattern) Reads all paths matching the specified pattern using default options. -
Method Summary
Modifier and TypeMethodDescriptionprotected final voidComposes a reader for the source, using configured options and a derived format.protected abstract DataFormatDetermines the data format for the source.Gets how fields found when parsing the record, but not declared in the schema are handled.Gets how fields which cannot be parsed are handled.intGets the maximum length allowed for a field value before it is considered an error.booleanIndicates whether the parsed records should be tagged with additional fields indicating their source.Gets how fields declared in the schema, but not found when parsing the record are handled.Gets the record port providing the records read from the data source.Gets the parsing options used by the reader.booleanIndicates whether pessimistic file splitting is used.intGets the size of the I/O buffer, in bytes, to use for reads.booleanIndicates whether the reader should execute on the client.intGets the maximum number of records allowed to have parse warnings.Gets the list of record fields to read.Gets the source being read.Gets the configuration used in determining how to break the source into splits.booleanIndicates whether discovered metadata should be used to override the graph settings.voidsetExtraFieldAction(ParseErrorAction action) Sets how to handle fields found when parsing the record, but not declared in the schema.voidsetFieldErrorAction(ParseErrorAction action) Sets how to handle fields which cannot be parsed.voidsetFieldLengthThreshold(int limit) Configures the maximum length allowed for a field value before it is considered an error.voidsetIncludeSourceInfo(boolean enabled) Controls whether parsed records will be tagged with additional fields indicating how to locate them in their original source.voidSets how to handle fields declared in the schema, but not found when parsing the record.voidsetParseErrorAction(ParseErrorAction action) Sets how to handle all parsing errors.voidsetParseOptions(ParsingOptions options) Sets the parsing options used by the reader.voidsetPessimisticSplitting(boolean enabled) Configures whether pessimistic file splitting must be used.voidsetReadBuffer(int size) Sets the size of the I/O buffer, in bytes, to use for reads.voidsetReadOnClient(boolean enabled) Sets whether reads are performed by the client or in the cluster.voidsetRecordWarningThreshold(int limit) Configures the maximum number of records which can have parse warnings before failing.voidsetSelectedFields(String... fields) Sets the list of record fields to read.voidsetSelectedFields(List<String> fields) Sets the list of record fields to read.voidSets the data source to the specified path.voidsetSource(ByteSource source) Sets the data source to the specified source.voidSets the data source to all paths matching the specified pattern.voidsetSplitOptions(SplitOptions options) Sets the configuration used in determining how to break the source into splits.voidsetUseMetadata(boolean useMetadata) Sets whether discovered metadata should be used to override the graph settings.Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyErrorMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface com.pervasive.datarush.operators.LogicalOperator
disableParallelism, getInputPorts, getOutputPorts
-
Field Details
-
output
The output port of the read operator -
options
Container for options related to record parsing
-
-
Constructor Details
-
AbstractReader
protected AbstractReader()Reads an empty source with default settings. The source must be set before execution or an error will be raised.- See Also:
-
AbstractReader
Reads all paths matching the specified pattern using default options. Any matching path which is a directory is replaced with all files in the directory; this expansion is not recursive.- Parameters:
pattern- a path-matching pattern- See Also:
-
FileClient#matchPaths(String)
-
AbstractReader
Reads the file specified by the path. If the path refers to a a directory, all files in the directory are read; this read is not recursive into sub-directories.- Parameters:
path- the path to read
-
AbstractReader
Reads the specified data source using default options.- Parameters:
source- the data source to read
-
-
Method Details
-
getOutput
Gets the record port providing the records read from the data source.- Specified by:
getOutputin interfaceRecordSourceOperator- Specified by:
getOutputin interfaceSourceOperator<RecordPort>- Returns:
- the flow of records produced by a sequential read of the data source
-
getSource
Gets the source being read.- Returns:
- the data source to read
-
setSource
Sets the data source to all paths matching the specified pattern. Any matching path which is a directory is replaced with all files in the directory; this expansion is not recursive.- Parameters:
pattern- a path-matching pattern- See Also:
-
FileClient#matchPaths(String)
-
setSource
Sets the data source to the specified path. If the path refers to a a directory, all files in the directory are read; this read is not recursive into sub-directories.- Parameters:
path- the path to read
-
setSource
Sets the data source to the specified source.- Parameters:
source- the data source to read
-
getSplitOptions
Gets the configuration used in determining how to break the source into splits.- Returns:
- the options used when generating splits for the source
-
setSplitOptions
Sets the configuration used in determining how to break the source into splits.This is an advanced option provided for performance tuning. Normally, it is not necessary to configure these settings.
- Parameters:
options- the options to use when generating splits for the source
-
getPessimisticSplitting
public boolean getPessimisticSplitting()Indicates whether pessimistic file splitting is used.- Returns:
- whether input sources can be broken into splits
-
setPessimisticSplitting
public void setPessimisticSplitting(boolean enabled) Configures whether pessimistic file splitting must be used. By default, this is disabled.- Parameters:
enabled- indicates whether to use pessimistic splitting- See Also:
-
getParseOptions
Gets the parsing options used by the reader.- Returns:
- the parse options
-
setParseOptions
Sets the parsing options used by the reader. This sets all parse options at once.- Parameters:
options- the parse options to use- See Also:
-
DataParser
-
getSelectedFields
Gets the list of record fields to read.- Returns:
- the fields which will be read.
-
setSelectedFields
Sets the list of record fields to read. If only a subset of fields are desired, it can be more efficient to parse only those fields. Only fields in this list will be in the output records. An empty list indicates all fields should be parsed; this is the default setting.- Parameters:
fields- the record fields to read
-
setSelectedFields
Sets the list of record fields to read. If only a subset of fields are desired, it can be more efficient to parse only those fields. Only fields in this list will be in the output records. An empty list indicates all fields should be parsed; this is the default setting.- Parameters:
fields- the record fields to read
-
getIncludeSourceInfo
public boolean getIncludeSourceInfo()Indicates whether the parsed records should be tagged with additional fields indicating their source.- Returns:
trueif additional source information is provided with records- See Also:
-
setIncludeSourceInfo
public void setIncludeSourceInfo(boolean enabled) Controls whether parsed records will be tagged with additional fields indicating how to locate them in their original source. This information can also be used to reconstruct the original source ordering.When enabled, output will have three additional fields which, considered as a triple, uniquely identify and order the output.
- sourcePath, a string naming the original source file from
which the record originated. This is
NULLif this cannot be determined. - splitOffset, a long providing the starting byte offset of the the parse split in the file.
- recordOffset, a long providing the starting offset for the record within the parse split. This value will be in units of bytes or characters as is appropriate for the parsed format. Character offsets are based on the first full character within the split.
setSelectedFields(List).- Parameters:
enabled- indicates whether the output should be tagged with source information
- sourcePath, a string naming the original source file from
which the record originated. This is
-
getMissingFieldAction
Gets how fields declared in the schema, but not found when parsing the record are handled.- Returns:
- the action to take on missing fields
-
setMissingFieldAction
Sets how to handle fields declared in the schema, but not found when parsing the record. If the configured action does not discard the record, the missing fields will be null-valued in the output. By default, this setting isParseErrorAction.WARN.- Parameters:
action- the action to take on missing fields
-
getExtraFieldAction
Gets how fields found when parsing the record, but not declared in the schema are handled.- Returns:
- the action to take on extra fields
-
setExtraFieldAction
Sets how to handle fields found when parsing the record, but not declared in the schema. If the configured action does not discard the record, the missing fields will be null-valued in the output. By default, this setting isParseErrorAction.WARN.- Parameters:
action- the action to take on extra fields
-
getFieldErrorAction
Gets how fields which cannot be parsed are handled.- Returns:
- the action to take on field errors
-
setFieldErrorAction
Sets how to handle fields which cannot be parsed. If the configured action does not discard the record, the missing fields will be null-valued in the output. By default, this setting isParseErrorAction.WARN.- Parameters:
action- the action to take on field errors
-
setParseErrorAction
Sets how to handle all parsing errors. This method is a convenience method for setting all individual classes of errors at once.- Parameters:
action- the action to take on parse error- See Also:
-
getRecordWarningThreshold
public int getRecordWarningThreshold()Gets the maximum number of records allowed to have parse warnings.- Returns:
- the limit on the number of records in error
-
setRecordWarningThreshold
public void setRecordWarningThreshold(int limit) Configures the maximum number of records which can have parse warnings before failing. Only records which have an error whose configured action raises a warning count towards this limit. A record is only counted once towards this limit, regardless of the number of warnings.By default, this limit is
100. Setting the limit to0means there is no restriction on the number of warnings.This limit is applied per-split. Therefore, it is possible that a file in total may be allowed more warnings than the limit, depending on how it is split.
- Parameters:
limit- the number of records with warnings allowed
-
getFieldLengthThreshold
public int getFieldLengthThreshold()Gets the maximum length allowed for a field value before it is considered an error.- Returns:
- the maximum field value length allowed
-
setFieldLengthThreshold
public void setFieldLengthThreshold(int limit) Configures the maximum length allowed for a field value before it is considered an error. Long fields can be a sign of a misconfigured reader. When this limit is reached, the parser will fail the current field and attempt to restart on the next field.By default, this limit is 1M. Setting the limit to
0means there is no restriction on the number of warnings.- Parameters:
limit- the maximum field value length allowed
-
getReadBuffer
public int getReadBuffer()Gets the size of the I/O buffer, in bytes, to use for reads.- Returns:
- the size of the read buffer
-
setReadBuffer
public void setReadBuffer(int size) Sets the size of the I/O buffer, in bytes, to use for reads. The default size is 64K.- Parameters:
size- the size of the read buffer
-
getReadOnClient
public boolean getReadOnClient()Indicates whether the reader should execute on the client.- Returns:
trueif the reader will execute on the client,falseotherwise.
-
setReadOnClient
public void setReadOnClient(boolean enabled) Sets whether reads are performed by the client or in the cluster. By default, reads are performed in the cluster, if executed in a distributed context.This normally only needs to be set if the source is not accessible from the cluster; for example, when the file resides only on the client. When executing in in pseudo-distributed mode, this setting has no effect.
If the intent is to suppress parallel reads (to guarantee record ordering matches the source, for instance), use
AbstractLogicalOperator.disableParallelism()instead.- Parameters:
enabled- whether to read on client
-
getUseMetadata
public boolean getUseMetadata()Indicates whether discovered metadata should be used to override the graph settings.- Returns:
trueif discovered metadata should be used,falseotherwise.
-
setUseMetadata
public void setUseMetadata(boolean useMetadata) Sets whether discovered metadata should be used to override the graph settings. By default this is set to false, since not all readers may actually support metadata.- Parameters:
useMetadata- whether discovered metadata should be used
-
computeFormat
Determines the data format for the source. The returned format is used during composition to construct aReadSourceoperator. If an implementation supports schema discovery, it must be performed in this method.- Parameters:
ctx- the composition context for the current invocation ofcompose(CompositionContext)- Returns:
- the source format to use
-
compose
Composes a reader for the source, using configured options and a derived format.- Specified by:
composein classCompositeOperator- Parameters:
ctx- the context
-