public final class ReadDelimitedText extends AbstractTextReader
Delimited text supports up to three distinct user-defined sequences within a record, used to identify field boundaries:
The reader supports incomplete specification of the separators and delimiters.
By default, it will attempt to automatically discover these values based
on analysis of a sample of the file. See DelimitedTextAnalyzer
for
more information. It is strongly suggested, however, that this
discovery ability not be relied upon if these values are already known,
as it cannot be guaranteed to produce desirable results in all cases.
The reader requires a RecordTextSchema
to provide parsing and type
information for fields. The schema, in conjunction with any specified field
filter, defines the output type of the reader. This can be manually
constructed via the API provided, although this metadata is often persisted
externally. StructuredSchemaReader
provides support for reading
in Pervasive DataIntegrator structured schema descriptors (.schema files) for
use with readers. Because delimited text has explicit field markers, it is
also possible to perform automated discovery of the schema based on the contents
of the file; the reader provides a pluggable discovery mechanism to support
this function. By default, the schema will be automatically discovered,
with all fields assumed to be strings. Discovered fields are named using
the header row if present. Otherwise, names are sequentially generated.
Normally, the output of the reader includes all records in the file, both those with and without parsing errors. Fields which can not be parsed are null valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.
Delimited text data may or may not have a header row. The header row is delimited as usual but contains the names of the fields in the data portion of the record. The reader must be told whether a header row exists or not. If it does, the parser will skip the header row; otherwise the first row is treated as a record and will appear in the output.
Delimited text files can be parsed in parallel under "optimistic" assumptions: namely, that parse splits do not occur in the middle of a delimited field value and somewhere before an escaped record separator. This is assumed by default, but can be disabled, with an accompanying reduction of scalability and performance.
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_ANALYSIS_DEPTH
The default number of characters analyzed when performing structure and schema discovery
|
encodingProps
options, output
Constructor and Description |
---|
ReadDelimitedText()
Reads an empty source with default settings.
|
ReadDelimitedText(ByteSource source)
Reads the specified data source using default
options.
|
ReadDelimitedText(Path path)
Reads the file specified by the path as delimited text
using default options.
|
ReadDelimitedText(String pattern)
Reads all paths matching the specified pattern
as delimited text using default options.
|
Modifier and Type | Method and Description |
---|---|
void |
autoConfigure(FileClient ctx)
Performs any configured discovery on the operator using the current source
and applies the result to configuration.
|
ReadDelimitedText |
clone()
Creates a copy of the reader with identical settings.
|
protected DataFormat |
computeFormat(CompositionContext ctx)
Determines the data format for the source.
|
RecordTextSchema<?> |
discoverSchema(FileClient ctx)
Run schema discovery using current configuration.
|
int |
getAnalysisDepth()
Gets the number of characters to read for
schema discovery and structural analysis
of the file.
|
boolean |
getAutoDiscoverNewline()
Indicates whether the reader should attempt to discover
the newline style (UNIX or DOS) used in the source.
|
FieldDelimiterSettings |
getDelimiters()
Gets the field delimiter settings used by the reader.
|
String |
getDiscoveryNullIndicator()
Gets the text value used to represent null values by
default in discovered schemas.
|
TextTypes.StringConversion |
getDiscoveryStringHandling()
Gets the default behavior for processing string-valued types
in discovered schemas.
|
String |
getFieldEndDelimiter()
Gets the end of field delimiter.
|
String |
getFieldSeparator()
Gets the delimiter used to distinguish field boundaries.
|
String |
getFieldStartDelimiter()
Gets the start of field delimiter.
|
boolean |
getHeader()
Indicates whether a header row is expected in the source data.
|
int |
getHeaderSkipCount()
Gets the number of lines to skip at the beginning of the file.
|
String |
getLineComment()
Gets the character sequence indicating a line comment.
|
int |
getMaxRowLength()
Gets the limit, in characters, for the first row.
|
String |
getRecordSeparator()
Gets the value used as a record separator.
|
RecordTextSchema<?> |
getSchema()
Gets the record schema of the delimited text source.
|
TextRecordDiscoverer |
getSchemaDiscovery()
Gets the schema discoverer to use on the delimited text source.
|
boolean |
getValidateRecordSeparator()
Gets whether the configured record separator should be validated.
|
void |
setAnalysisDepth(int count)
Sets the number of characters to read for performing
schema discovery and structural analysis.
|
void |
setAutoDiscoverNewline(boolean enabled)
Configures whether the reader attempts to discover the
newline style (UNIX or DOS) used in the source.
|
void |
setDelimiters(FieldDelimiterSettings settings)
Sets the field delimiter settings for the reader.
|
void |
setDiscoveryNullIndicator(String value)
Sets the text value used to represent null values
by default in discovered schemas.
|
void |
setDiscoveryStringHandling(TextTypes.StringConversion behavior)
Sets the default behavior for processing string-valued types
in discovered schemas.
|
void |
setFieldDelimiter(String delimiter)
Sets the delimiter used to denote the boundaries of a data field.
|
void |
setFieldEndDelimiter(String delimiter)
Sets the delimiter used to denote the end of a data field.
|
void |
setFieldSeparator(String separator)
Sets the delimiter used to define the boundary between data fields.
|
void |
setFieldStartDelimiter(String delimiter)
Sets the delimiter used to denote the beginning of a data field.
|
void |
setHeader(boolean header)
Configures whether to expect a header row in the source.
|
void |
setHeaderSkipCount(int count)
Sets the number of lines to skip at the beginning of the file
before reading.
|
void |
setLineComment(String lineComment)
Sets the character sequence indicating a line comment.
|
void |
setMaxRowLength(int maxRowLength)
Deprecated.
Use
setValidateRecordSeparator(boolean) in
conjunction with setAnalysisDepth(int) instead. |
void |
setRecordSeparator(String separator)
Sets the value to use as a record separator.
|
void |
setSchema(RecordTextSchema<?> schema)
Sets the record schema expected in the delimited text source.
|
void |
setSchemaDiscovery(List<TypePattern> patterns)
Enables schema discovery using the default discoverer
extended with additional typing patterns.
|
void |
setSchemaDiscovery(TextRecordDiscoverer discoverer)
Sets the schema discoverer to use against the delimited text source.
|
void |
setValidateRecordSeparator(boolean enabled)
Sets whether the configured record separator should be validated.
|
getCharset, getCharsetName, getDecodeBuffer, getEncoding, getErrorAction, getReplacement, setCharset, setCharsetName, setDecodeBuffer, setEncoding, setErrorAction, setReplacement
compose, getExtraFieldAction, getFieldErrorAction, getFieldLengthThreshold, getIncludeSourceInfo, getMissingFieldAction, getOutput, getParseOptions, getPessimisticSplitting, getReadBuffer, getReadOnClient, getRecordWarningThreshold, getSelectedFields, getSource, getSplitOptions, getUseMetadata, setExtraFieldAction, setFieldErrorAction, setFieldLengthThreshold, setIncludeSourceInfo, setMissingFieldAction, setParseErrorAction, setParseOptions, setPessimisticSplitting, setReadBuffer, setReadOnClient, setRecordWarningThreshold, setSelectedFields, setSelectedFields, setSource, setSource, setSource, setSplitOptions, setUseMetadata
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
disableParallelism, getInputPorts, getOutputPorts
public static final int DEFAULT_ANALYSIS_DEPTH
public ReadDelimitedText()
A default schema of all string typed fields
will be generated based on analysis of the source,
unless otherwise configured via
setSchema(RecordTextSchema)
or
setSchemaDiscovery(TextRecordDiscoverer)
.
AbstractReader.setSource(ByteSource)
public ReadDelimitedText(String pattern)
A default schema of all string typed fields
will be generated based on analysis of the source,
unless otherwise configured via
setSchema(RecordTextSchema)
or
setSchemaDiscovery(TextRecordDiscoverer)
.
pattern
- a path-matching patternFileClient.matchPaths(String)
public ReadDelimitedText(Path path)
A default schema of all string typed fields
will be generated based on analysis of the source,
unless otherwise configured via
setSchema(RecordTextSchema)
or
setSchemaDiscovery(TextRecordDiscoverer)
.
path
- the path to readpublic ReadDelimitedText(ByteSource source)
A default schema of all string typed fields
will be generated based on analysis of the source,
unless otherwise configured via
setSchema(RecordTextSchema)
or
setSchemaDiscovery(TextRecordDiscoverer)
.
source
- the data source to readpublic ReadDelimitedText clone()
public int getAnalysisDepth()
public void setAnalysisDepth(int count)
count
- the number of characters to use to determine
the schema and/or file structurepublic RecordTextSchema<?> getSchema()
null
.public void setSchema(RecordTextSchema<?> schema)
Setting a schema overrides any previously configured schema discovery.
schema
- the expected record schema of the sourcesetSchemaDiscovery(TextRecordDiscoverer)
,
AbstractReader.setSelectedFields(java.util.List)
public TextRecordDiscoverer getSchemaDiscovery()
null
.public void setSchemaDiscovery(TextRecordDiscoverer discoverer)
By default, the schema will be discovered automatically. All fields are assumed to be strings and the field names are taken from the header row. If no header is present, field names will be generated in sequence: "field0", "field1", ...
Setting schema discovery overrides any previously configured schema.
discoverer
- the schema discoverer to use.setSchema(RecordTextSchema)
,
AbstractReader.setSelectedFields(java.util.List)
public void setSchemaDiscovery(List<TypePattern> patterns)
setSchemaDiscovery(TextRecordDiscoverer)
with an appropriately configured discoverer instead.patterns
- the additional patterns to apply at lower
precedence than default patternsPatternBasedDiscovery
public String getDiscoveryNullIndicator()
public void setDiscoveryNullIndicator(String value)
value
- the string indicating a null valuepublic TextTypes.StringConversion getDiscoveryStringHandling()
public void setDiscoveryStringHandling(TextTypes.StringConversion behavior)
behavior
- indicates how string-valued types should be converted
from textpublic boolean getHeader()
public void setHeader(boolean header)
header
- indicates whether the source has a header rowpublic int getHeaderSkipCount()
public void setHeaderSkipCount(int count)
count
- the number lines at the start of the file to skippublic String getLineComment()
public void setLineComment(String lineComment)
lineComment
- the character sequence marking the
start of a line commentpublic int getMaxRowLength()
@Deprecated public void setMaxRowLength(int maxRowLength)
setValidateRecordSeparator(boolean)
in
conjunction with setAnalysisDepth(int)
instead.0
, no limit is enforced; this
is the default.
This setting can be used to catch errors involving a
misconfigured record separator prior to graph execution.
If a limit is configured, the source will be read just prior
to executing the graph. If no record separator is
found prior to reaching the limit, a RowTooLongException
is thrown.
maxRowLength
- the limit on the size of the first rowsetAutoDiscoverNewline(boolean)
,
setRecordSeparator(String)
public FieldDelimiterSettings getDelimiters()
public void setDelimiters(FieldDelimiterSettings settings)
settings
- the field delimiter settings to usepublic boolean getValidateRecordSeparator()
true
if the record separator will be validated
during analysis, false
otherwisepublic void setValidateRecordSeparator(boolean enabled)
getAnalysisDepth()
, an error
will be raised during file analysis prior to execution.
This setting is only meaningful if a specific record separator
has been provided. It is ignored if automated discovery of the record
separator is enabled.enabled
- indicates whether to enable record separator validationpublic String getRecordSeparator()
public void setRecordSeparator(String separator)
By default the record separator is set to the default record separator for the installed operating system of the execution environment.
separator
- the value to use as a record separatorcom.pervasive.datarush.graphs.physical.InvalidPropertyValueException
- if the separator is
null
or the empty stringsetAutoDiscoverNewline(boolean)
public boolean getAutoDiscoverNewline()
public void setAutoDiscoverNewline(boolean enabled)
enabled
- indicates whether to enable newline
discoverypublic String getFieldSeparator()
public void setFieldSeparator(String separator)
separator
- string used to separate fieldscom.pervasive.datarush.graphs.physical.InvalidPropertyValueException
- if the delimiter is
null
or the empty stringpublic void setFieldDelimiter(String delimiter)
This method is generally equivalent to calling
setFieldStartDelimiter()
and setFieldEndDelimiter()
with the same parameter values. However, those methods do not
allow the empty string as a parameter.
delimiter
- string used to optionally mark the start and end
of a field value. An empty string indicates field values are not
delimited.com.pervasive.datarush.graphs.physical.InvalidPropertyValueException
- if the delimiter is
null
public String getFieldStartDelimiter()
public void setFieldStartDelimiter(String delimiter)
setFieldDelimiter(String)
instead to indicate no
delimiters.delimiter
- string used to mark the start of a field valuecom.pervasive.datarush.graphs.physical.InvalidPropertyValueException
- if the delimiter is
null
or the empty stringpublic String getFieldEndDelimiter()
public void setFieldEndDelimiter(String delimiter)
setFieldDelimiter(String)
instead to indicate no
delimiters.delimiter
- string used to mark the start of a field valuecom.pervasive.datarush.graphs.physical.InvalidPropertyValueException
- if the delimiter is
null
or the empty stringprotected DataFormat computeFormat(CompositionContext ctx)
AbstractReader
ReadSource
operator. If an
implementation supports schema discovery, it must be
performed in this method.computeFormat
in class AbstractReader
ctx
- the composition context for the current invocation
of AbstractReader.compose(CompositionContext)
public void autoConfigure(FileClient ctx) throws IOException
ctx
- the authorization context to use for accessing the sourceIOException
- if errors occur during discovery and analysis of the sourcepublic RecordTextSchema<?> discoverSchema(FileClient ctx)
ctx
- the authorization context to use for accessing the fileCopyright © 2020 Actian Corporation. All rights reserved.