public final class ReadSource extends CompositeOperator implements RecordSourceOperator
This operator is low-level, providing a generalized
model for reading files in a distributed fashion.
Typically, ReadSource
is not directly used in
a graph, instead being indirectly used though a composite
operator such as one derived from AbstractReader
providing a more appropriate interface to the end user.
Parallelized reads are implemented by breaking input files
into independently parsed pieces, a process called splitting.
Splits are then distributed to available partitions and parsed.
When run on a distributed cluster, the reader makes an attempt
to assign splits to machines where the I/O will be local, but
non-local assignment may occur in order to provide work for
all partitions. Distributed execution also makes an assumption that
the specified data source is accessible from any machine. If this
is not the case, the read must be made non-parallel by
using AbstractLogicalOperator.disableParallelism()
.
Not all formats support splitting; this generally requires a way of unambiguously identifying record boundaries. Formats will indicate whether they can support splitting. If not, each input file will be treated as a single split. Even with a non-splittable format, this means reading multiple files can be parallelized. Some formats can partially support splitting, but in a "optimistic" fashion; under most circumstances splits can be handled, but in some edge cases splitting leads to parse failures. For these cases, the reader supports a "pessimistic" mode which can be used to assume a format is non-splittable, regardless of what it reports.
The reader makes a best-effort attempt to validate the data source before execution, but cannot always guarantee correctness, depending on the nature of the data source. This is done to try to prevent misconfigured graphs from executing where the reader may not execute until a late phase when a failure may result in the loss of a significant amount of work being lost.
Constructor and Description |
---|
ReadSource()
Reads an empty source with default settings.
|
ReadSource(ByteSource source,
DataFormat format)
Reads the specified source using the given format.
|
Modifier and Type | Method and Description |
---|---|
protected void |
compose(CompositionContext ctx)
Compose the body of this operator.
|
DataFormat |
getFormat()
Gets the data format for the configured source.
|
boolean |
getIncludeSourceInfo()
Indicates whether the parsed records should be tagged with
additional fields indicating their source.
|
RecordPort |
getOutput()
Gets the record port providing the records read from the data source.
|
ParsingOptions |
getParseOptions()
Gets the parsing options used by the reader.
|
boolean |
getPessimisticSplitting()
Indicates whether pessimistic file splitting
is used.
|
boolean |
getReadOnClient()
Indicates whether the reader should execute
on the client.
|
ByteSource |
getSource()
Gets the data source for the reader.
|
SplitOptions |
getSplitOptions()
Gets the parsing options used by the reader.
|
void |
setFormat(DataFormat format)
Sets the data format for the configured source.
|
void |
setIncludeSourceInfo(boolean enabled)
Controls whether parsed records will be tagged with additional
fields indicating how to locate them in their original source.
|
void |
setParseOptions(ParsingOptions options)
Sets the parsing options used by the reader.
|
void |
setPessimisticSplitting(boolean enabled)
Configures whether pessimistic file splitting
must be used.
|
void |
setReadOnClient(boolean enabled)
Sets whether reads are performed by the client or
in the cluster.
|
void |
setSource(ByteSource source)
Sets the data source for the reader.
|
void |
setSplitOptions(SplitOptions options)
Sets the split options used by the reader.
|
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
disableParallelism, getInputPorts, getOutputPorts
public ReadSource()
setSource(ByteSource)
,
setFormat(DataFormat)
public ReadSource(ByteSource source, DataFormat format)
source
- the source to readformat
- the source data formatpublic RecordPort getOutput()
getOutput
in interface RecordSourceOperator
getOutput
in interface SourceOperator<RecordPort>
public ByteSource getSource()
public void setSource(ByteSource source)
source
- the source to readpublic DataFormat getFormat()
public void setFormat(DataFormat format)
format
- the source data formatpublic boolean getIncludeSourceInfo()
true
if additional source information is
provided with recordssetIncludeSourceInfo(boolean)
public void setIncludeSourceInfo(boolean enabled)
When enabled, output will have three additional fields which, considered as a triple, uniquely identify and order the output.
NULL
if this cannot
be determined.#setSelectedFields(List)
.enabled
- indicates whether the output should be tagged with
source informationpublic ParsingOptions getParseOptions()
public void setParseOptions(ParsingOptions options)
options
- the parse options to useDataParser
public SplitOptions getSplitOptions()
public void setSplitOptions(SplitOptions options)
options
- the split options to useDataParser
public boolean getPessimisticSplitting()
public void setPessimisticSplitting(boolean enabled)
enabled
- indicates whether to use
pessimistic splittingpublic boolean getReadOnClient()
true
if the reader will execute
on the client, false
otherwise.public void setReadOnClient(boolean enabled)
This normally only needs to be set if the source is not accessible from the cluster; for example, when the file resides only on the client.
enabled
- whether to perform reads on the clientprotected void compose(CompositionContext ctx)
CompositeOperator
OperatorComposable.add(O)
OperatorComposable.connect(P, P)
. This includes
connections from the composite's input ports to sub-operators, connections between sub-operators, and
connections from sub-operators output ports to the composite's output portscompose
in class CompositeOperator
ctx
- the contextCopyright © 2020 Actian Corporation. All rights reserved.