public abstract class AbstractReader extends CompositeOperator implements RecordSourceOperator
ByteSource
.
This is most often a file or files, so convenience methods
for specifying the source as a file are provided.AbstractReader
wraps a ReadSource
operator.
Implementations provide an interface specific to
an appropriate class of files - delimited text files, as an
example - hiding the more complex model used to describe
data parsing in general. The composition structure is the
same for all readers with only DataFormat
used
differing between implementations.Modifier and Type | Field and Description |
---|---|
protected ParsingOptions |
options
Container for options related to record parsing
|
protected RecordPort |
output
The output port of the read operator
|
Modifier | Constructor and Description |
---|---|
protected |
AbstractReader()
Reads an empty source with default settings.
|
protected |
AbstractReader(ByteSource source)
Reads the specified data source using default
options.
|
protected |
AbstractReader(Path path)
Reads the file specified by the path.
|
protected |
AbstractReader(String pattern)
Reads all paths matching the specified pattern
using default options.
|
Modifier and Type | Method and Description |
---|---|
protected void |
compose(CompositionContext ctx)
Composes a reader for the source, using configured options
and a derived format.
|
protected abstract DataFormat |
computeFormat(CompositionContext ctx)
Determines the data format for the source.
|
ParseErrorAction |
getExtraFieldAction()
Gets how fields found when parsing the record,
but not declared in the schema are handled.
|
ParseErrorAction |
getFieldErrorAction()
Gets how fields which cannot be parsed are handled.
|
int |
getFieldLengthThreshold()
Gets the maximum length allowed for a field
value before it is considered an error.
|
boolean |
getIncludeSourceInfo()
Indicates whether the parsed records should be tagged with
additional fields indicating their source.
|
ParseErrorAction |
getMissingFieldAction()
Gets how fields declared in the schema, but
not found when parsing the record are handled.
|
RecordPort |
getOutput()
Gets the record port providing the records read from the data source.
|
ParsingOptions |
getParseOptions()
Gets the parsing options used by the reader.
|
boolean |
getPessimisticSplitting()
Indicates whether pessimistic file splitting
is used.
|
int |
getReadBuffer()
Gets the size of the I/O buffer, in bytes,
to use for reads.
|
boolean |
getReadOnClient()
Indicates whether the reader should execute
on the client.
|
int |
getRecordWarningThreshold()
Gets the maximum number of records allowed to
have parse warnings.
|
List<String> |
getSelectedFields()
Gets the list of record fields to read.
|
ByteSource |
getSource()
Gets the source being read.
|
SplitOptions |
getSplitOptions()
Gets the configuration used in determining how
to break the source into splits.
|
boolean |
getUseMetadata()
Indicates whether discovered metadata should be
used to override the graph settings.
|
void |
setExtraFieldAction(ParseErrorAction action)
Sets how to handle fields found when parsing the record,
but not declared in the schema.
|
void |
setFieldErrorAction(ParseErrorAction action)
Sets how to handle fields which cannot be parsed.
|
void |
setFieldLengthThreshold(int limit)
Configures the maximum length allowed for a field
value before it is considered an error.
|
void |
setIncludeSourceInfo(boolean enabled)
Controls whether parsed records will be tagged with additional
fields indicating how to locate them in their original source.
|
void |
setMissingFieldAction(ParseErrorAction action)
Sets how to handle fields declared in the schema, but
not found when parsing the record.
|
void |
setParseErrorAction(ParseErrorAction action)
Sets how to handle all parsing errors.
|
void |
setParseOptions(ParsingOptions options)
Sets the parsing options used by the reader.
|
void |
setPessimisticSplitting(boolean enabled)
Configures whether pessimistic file splitting
must be used.
|
void |
setReadBuffer(int size)
Sets the size of the I/O buffer, in bytes,
to use for reads.
|
void |
setReadOnClient(boolean enabled)
Sets whether reads are performed by the client or
in the cluster.
|
void |
setRecordWarningThreshold(int limit)
Configures the maximum number of records which can have
parse warnings before failing.
|
void |
setSelectedFields(List<String> fields)
Sets the list of record fields to read.
|
void |
setSelectedFields(String... fields)
Sets the list of record fields to read.
|
void |
setSource(ByteSource source)
Sets the data source to the specified source.
|
void |
setSource(Path path)
Sets the data source to the specified path.
|
void |
setSource(String pattern)
Sets the data source to all paths matching the
specified pattern.
|
void |
setSplitOptions(SplitOptions options)
Sets the configuration used in determining how
to break the source into splits.
|
void |
setUseMetadata(boolean useMetadata)
Sets whether discovered metadata should be
used to override the graph settings.
|
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
disableParallelism, getInputPorts, getOutputPorts
protected final RecordPort output
protected final ParsingOptions options
protected AbstractReader()
setSource(ByteSource)
protected AbstractReader(String pattern)
pattern
- a path-matching patternFileClient#matchPaths(String)
protected AbstractReader(Path path)
path
- the path to readprotected AbstractReader(ByteSource source)
source
- the data source to readpublic RecordPort getOutput()
getOutput
in interface RecordSourceOperator
getOutput
in interface SourceOperator<RecordPort>
public ByteSource getSource()
public void setSource(String pattern)
pattern
- a path-matching patternFileClient#matchPaths(String)
public void setSource(Path path)
path
- the path to readpublic void setSource(ByteSource source)
source
- the data source to readpublic SplitOptions getSplitOptions()
public void setSplitOptions(SplitOptions options)
This is an advanced option provided for performance tuning. Normally, it is not necessary to configure these settings.
options
- the options to use when generating splits
for the sourcepublic boolean getPessimisticSplitting()
public void setPessimisticSplitting(boolean enabled)
enabled
- indicates whether to use
pessimistic splittingReadSource
public ParsingOptions getParseOptions()
public void setParseOptions(ParsingOptions options)
options
- the parse options to useDataParser
public List<String> getSelectedFields()
public void setSelectedFields(String... fields)
fields
- the record fields to readpublic void setSelectedFields(List<String> fields)
fields
- the record fields to readpublic boolean getIncludeSourceInfo()
true
if additional source information is
provided with recordssetIncludeSourceInfo(boolean)
public void setIncludeSourceInfo(boolean enabled)
When enabled, output will have three additional fields which, considered as a triple, uniquely identify and order the output.
NULL
if this cannot
be determined.setSelectedFields(List)
.enabled
- indicates whether the output should be tagged with
source informationpublic ParseErrorAction getMissingFieldAction()
public void setMissingFieldAction(ParseErrorAction action)
ParseErrorAction.WARN
.action
- the action to take on missing fieldspublic ParseErrorAction getExtraFieldAction()
public void setExtraFieldAction(ParseErrorAction action)
ParseErrorAction.WARN
.action
- the action to take on extra fieldspublic ParseErrorAction getFieldErrorAction()
public void setFieldErrorAction(ParseErrorAction action)
ParseErrorAction.WARN
.action
- the action to take on field errorspublic void setParseErrorAction(ParseErrorAction action)
action
- the action to take on parse errorsetMissingFieldAction(ParseErrorAction)
,
setExtraFieldAction(ParseErrorAction)
,
setFieldErrorAction(ParseErrorAction)
public int getRecordWarningThreshold()
public void setRecordWarningThreshold(int limit)
By default, this limit is 100
. Setting the
limit to 0
means there is no restriction
on the number of warnings.
This limit is applied per-split. Therefore, it is possible that a file in total may be allowed more warnings than the limit, depending on how it is split.
limit
- the number of records with warnings
allowedpublic int getFieldLengthThreshold()
public void setFieldLengthThreshold(int limit)
By default, this limit is 1M. Setting the
limit to 0
means there is no restriction
on the number of warnings.
limit
- the maximum field value length
allowedpublic int getReadBuffer()
public void setReadBuffer(int size)
size
- the size of the read bufferpublic boolean getReadOnClient()
true
if the reader will execute
on the client, false
otherwise.public void setReadOnClient(boolean enabled)
This normally only needs to be set if the source is not accessible from the cluster; for example, when the file resides only on the client. When executing in in pseudo-distributed mode, this setting has no effect.
If the intent is to suppress parallel reads (to guarantee
record ordering matches the source, for instance), use
AbstractLogicalOperator.disableParallelism()
instead.
enabled
- whether to read on clientpublic boolean getUseMetadata()
true
if discovered metadata should be
used, false
otherwise.public void setUseMetadata(boolean useMetadata)
useMetadata
- whether discovered metadata should be usedprotected abstract DataFormat computeFormat(CompositionContext ctx)
ReadSource
operator. If an
implementation supports schema discovery, it must be
performed in this method.ctx
- the composition context for the current invocation
of compose(CompositionContext)
protected final void compose(CompositionContext ctx)
compose
in class CompositeOperator
ctx
- the contextCopyright © 2021 Actian Corporation. All rights reserved.