Class ReadDelimitedText
- java.lang.Object
-
- com.pervasive.datarush.operators.AbstractLogicalOperator
-
- com.pervasive.datarush.operators.CompositeOperator
-
- com.pervasive.datarush.operators.io.AbstractReader
-
- com.pervasive.datarush.operators.io.textfile.AbstractTextReader
-
- com.pervasive.datarush.operators.io.textfile.ReadDelimitedText
-
- All Implemented Interfaces:
LogicalOperator
,RecordSourceOperator
,SourceOperator<RecordPort>
public final class ReadDelimitedText extends AbstractTextReader
Reads a text file of delimited records as record tokens. Records are identified by the presence of a non-empty, user-defined record separator sequence between each individual record. Output records contain the same fields as the input text. The reader can also filter and/or reorder the fields of the output, as requested.Delimited text supports up to three distinct user-defined sequences within a record, used to identify field boundaries:
- a field separator, found between individual fields; by default, this is "," (comma).
- a field start delimiter, marking the beginning of a field value; by default, this is "\"" (double quote).
- a field end delimiter, marking the end of a field value; by default, this is "\"" (double quote).
The reader supports incomplete specification of the separators and delimiters. By default, it will attempt to automatically discover these values based on analysis of a sample of the file. See
DelimitedTextAnalyzer
for more information. It is strongly suggested, however, that this discovery ability not be relied upon if these values are already known, as it cannot be guaranteed to produce desirable results in all cases.The reader requires a
RecordTextSchema
to provide parsing and type information for fields. The schema, in conjunction with any specified field filter, defines the output type of the reader. This can be manually constructed via the API provided, although this metadata is often persisted externally.StructuredSchemaReader
provides support for reading in Pervasive DataIntegrator structured schema descriptors (.schema files) for use with readers. Because delimited text has explicit field markers, it is also possible to perform automated discovery of the schema based on the contents of the file; the reader provides a pluggable discovery mechanism to support this function. By default, the schema will be automatically discovered, with all fields assumed to be strings. Discovered fields are named using the header row if present. Otherwise, names are sequentially generated.Normally, the output of the reader includes all records in the file, both those with and without parsing errors. Fields which can not be parsed are null valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.
Delimited text data may or may not have a header row. The header row is delimited as usual but contains the names of the fields in the data portion of the record. The reader must be told whether a header row exists or not. If it does, the parser will skip the header row; otherwise the first row is treated as a record and will appear in the output.
Delimited text files can be parsed in parallel under "optimistic" assumptions: namely, that parse splits do not occur in the middle of a delimited field value and somewhere before an escaped record separator. This is assumed by default, but can be disabled, with an accompanying reduction of scalability and performance.
-
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_ANALYSIS_DEPTH
The default number of characters analyzed when performing structure and schema discovery-
Fields inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader
encodingProps
-
Fields inherited from class com.pervasive.datarush.operators.io.AbstractReader
options, output
-
-
Constructor Summary
Constructors Constructor Description ReadDelimitedText()
Reads an empty source with default settings.ReadDelimitedText(Path path)
Reads the file specified by the path as delimited text using default options.ReadDelimitedText(ByteSource source)
Reads the specified data source using default options.ReadDelimitedText(String pattern)
Reads all paths matching the specified pattern as delimited text using default options.
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description void
autoConfigure(FileClient ctx)
Performs any configured discovery on the operator using the current source and applies the result to configuration.ReadDelimitedText
clone()
Creates a copy of the reader with identical settings.protected DataFormat
computeFormat(CompositionContext ctx)
Determines the data format for the source.RecordTextSchema<?>
discoverSchema(FileClient ctx)
Run schema discovery using current configuration.int
getAnalysisDepth()
Gets the number of characters to read for schema discovery and structural analysis of the file.boolean
getAutoDiscoverNewline()
Indicates whether the reader should attempt to discover the newline style (UNIX or DOS) used in the source.FieldDelimiterSettings
getDelimiters()
Gets the field delimiter settings used by the reader.String
getDiscoveryNullIndicator()
Gets the text value used to represent null values by default in discovered schemas.TextTypes.StringConversion
getDiscoveryStringHandling()
Gets the default behavior for processing string-valued types in discovered schemas.String
getFieldEndDelimiter()
Gets the end of field delimiter.String
getFieldSeparator()
Gets the delimiter used to distinguish field boundaries.String
getFieldStartDelimiter()
Gets the start of field delimiter.boolean
getHeader()
Indicates whether a header row is expected in the source data.int
getHeaderSkipCount()
Gets the number of lines to skip at the beginning of the file.String
getLineComment()
Gets the character sequence indicating a line comment.int
getMaxRowLength()
Gets the limit, in characters, for the first row.String
getRecordSeparator()
Gets the value used as a record separator.RecordTextSchema<?>
getSchema()
Gets the record schema of the delimited text source.TextRecordDiscoverer
getSchemaDiscovery()
Gets the schema discoverer to use on the delimited text source.boolean
getValidateRecordSeparator()
Gets whether the configured record separator should be validated.void
setAnalysisDepth(int count)
Sets the number of characters to read for performing schema discovery and structural analysis.void
setAutoDiscoverNewline(boolean enabled)
Configures whether the reader attempts to discover the newline style (UNIX or DOS) used in the source.void
setDelimiters(FieldDelimiterSettings settings)
Sets the field delimiter settings for the reader.void
setDiscoveryNullIndicator(String value)
Sets the text value used to represent null values by default in discovered schemas.void
setDiscoveryStringHandling(TextTypes.StringConversion behavior)
Sets the default behavior for processing string-valued types in discovered schemas.void
setFieldDelimiter(String delimiter)
Sets the delimiter used to denote the boundaries of a data field.void
setFieldEndDelimiter(String delimiter)
Sets the delimiter used to denote the end of a data field.void
setFieldSeparator(String separator)
Sets the delimiter used to define the boundary between data fields.void
setFieldStartDelimiter(String delimiter)
Sets the delimiter used to denote the beginning of a data field.void
setHeader(boolean header)
Configures whether to expect a header row in the source.void
setHeaderSkipCount(int count)
Sets the number of lines to skip at the beginning of the file before reading.void
setLineComment(String lineComment)
Sets the character sequence indicating a line comment.void
setMaxRowLength(int maxRowLength)
Deprecated.UsesetValidateRecordSeparator(boolean)
in conjunction withsetAnalysisDepth(int)
instead.void
setRecordSeparator(String separator)
Sets the value to use as a record separator.void
setSchema(RecordTextSchema<?> schema)
Sets the record schema expected in the delimited text source.void
setSchemaDiscovery(TextRecordDiscoverer discoverer)
Sets the schema discoverer to use against the delimited text source.void
setSchemaDiscovery(List<TypePattern> patterns)
Enables schema discovery using the default discoverer extended with additional typing patterns.void
setValidateRecordSeparator(boolean enabled)
Sets whether the configured record separator should be validated.-
Methods inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader
getCharset, getCharsetName, getDecodeBuffer, getEncoding, getErrorAction, getReplacement, setCharset, setCharsetName, setDecodeBuffer, setEncoding, setErrorAction, setReplacement
-
Methods inherited from class com.pervasive.datarush.operators.io.AbstractReader
compose, getExtraFieldAction, getFieldErrorAction, getFieldLengthThreshold, getIncludeSourceInfo, getMissingFieldAction, getOutput, getParseOptions, getPessimisticSplitting, getReadBuffer, getReadOnClient, getRecordWarningThreshold, getSelectedFields, getSource, getSplitOptions, getUseMetadata, setExtraFieldAction, setFieldErrorAction, setFieldLengthThreshold, setIncludeSourceInfo, setMissingFieldAction, setParseErrorAction, setParseOptions, setPessimisticSplitting, setReadBuffer, setReadOnClient, setRecordWarningThreshold, setSelectedFields, setSelectedFields, setSource, setSource, setSource, setSplitOptions, setUseMetadata
-
Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.pervasive.datarush.operators.LogicalOperator
disableParallelism, getInputPorts, getOutputPorts
-
-
-
-
Field Detail
-
DEFAULT_ANALYSIS_DEPTH
public static final int DEFAULT_ANALYSIS_DEPTH
The default number of characters analyzed when performing structure and schema discovery- See Also:
- Constant Field Values
-
-
Constructor Detail
-
ReadDelimitedText
public ReadDelimitedText()
Reads an empty source with default settings. The source must be set before execution or an error will be raised.A default schema of all string typed fields will be generated based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)
orsetSchemaDiscovery(TextRecordDiscoverer)
.- See Also:
AbstractReader.setSource(ByteSource)
-
ReadDelimitedText
public ReadDelimitedText(String pattern)
Reads all paths matching the specified pattern as delimited text using default options. Any matching path which is a directory is replaced with all files in the directory; this expansion is not applied recursively.A default schema of all string typed fields will be generated based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)
orsetSchemaDiscovery(TextRecordDiscoverer)
.- Parameters:
pattern
- a path-matching pattern- See Also:
FileClient.matchPaths(String)
-
ReadDelimitedText
public ReadDelimitedText(Path path)
Reads the file specified by the path as delimited text using default options. If the path refers to a a directory, all files in the directory are read; this expansion is not applied recursively.A default schema of all string typed fields will be generated based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)
orsetSchemaDiscovery(TextRecordDiscoverer)
.- Parameters:
path
- the path to read
-
ReadDelimitedText
public ReadDelimitedText(ByteSource source)
Reads the specified data source using default options.A default schema of all string typed fields will be generated based on analysis of the source, unless otherwise configured via
setSchema(RecordTextSchema)
orsetSchemaDiscovery(TextRecordDiscoverer)
.- Parameters:
source
- the data source to read
-
-
Method Detail
-
clone
public ReadDelimitedText clone()
Creates a copy of the reader with identical settings. This is a deep copy; subsequent changes in the reader are not reflected in the clone and vice-versa.
-
getAnalysisDepth
public int getAnalysisDepth()
Gets the number of characters to read for schema discovery and structural analysis of the file.- Returns:
- the number of characters which will be analyzed
-
setAnalysisDepth
public void setAnalysisDepth(int count)
Sets the number of characters to read for performing schema discovery and structural analysis. This setting is ignored if no discovery is being performed; that is, separators and delimiters are fully specified and a schema is provided. The default setting is 1M characters.- Parameters:
count
- the number of characters to use to determine the schema and/or file structure
-
getSchema
public RecordTextSchema<?> getSchema()
Gets the record schema of the delimited text source. If schema discovery is enabled, this will returnnull
.- Returns:
- the record schema of the source
-
setSchema
public void setSchema(RecordTextSchema<?> schema)
Sets the record schema expected in the delimited text source. Output records will have this schema, adjusted accordingly for any configured field selection.Setting a schema overrides any previously configured schema discovery.
- Parameters:
schema
- the expected record schema of the source- See Also:
setSchemaDiscovery(TextRecordDiscoverer)
,AbstractReader.setSelectedFields(java.util.List)
-
getSchemaDiscovery
public TextRecordDiscoverer getSchemaDiscovery()
Gets the schema discoverer to use on the delimited text source. If schema discovery is disabled, this will returnnull
.- Returns:
- the configured schema discoverer
-
setSchemaDiscovery
public void setSchemaDiscovery(TextRecordDiscoverer discoverer)
Sets the schema discoverer to use against the delimited text source. Just prior to graph execution the source will be examined using the discoverer to determine the output schema for records. If reading multiple files, the schema is determined using the first file. Output records will have the discovered schema, adjusted accordingly for any configured field selection.By default, the schema will be discovered automatically. All fields are assumed to be strings and the field names are taken from the header row. If no header is present, field names will be generated in sequence: "field0", "field1", ...
Setting schema discovery overrides any previously configured schema.
- Parameters:
discoverer
- the schema discoverer to use.- See Also:
setSchema(RecordTextSchema)
,AbstractReader.setSelectedFields(java.util.List)
-
setSchemaDiscovery
public void setSchemaDiscovery(List<TypePattern> patterns)
Enables schema discovery using the default discoverer extended with additional typing patterns. The additional patterns are in addition to, not in place of, the normal discovery typing patterns. If overriding default rules is desired, usesetSchemaDiscovery(TextRecordDiscoverer)
with an appropriately configured discoverer instead.- Parameters:
patterns
- the additional patterns to apply at lower precedence than default patterns- See Also:
PatternBasedDiscovery
-
getDiscoveryNullIndicator
public String getDiscoveryNullIndicator()
Gets the text value used to represent null values by default in discovered schemas.- Returns:
- the string indicating a null value
-
setDiscoveryNullIndicator
public void setDiscoveryNullIndicator(String value)
Sets the text value used to represent null values by default in discovered schemas. By default, this is the empty string. If schema discovery is not enabled, this setting is ignored.- Parameters:
value
- the string indicating a null value
-
getDiscoveryStringHandling
public TextTypes.StringConversion getDiscoveryStringHandling()
Gets the default behavior for processing string-valued types in discovered schemas.- Returns:
- how string-valued types should be converted from text
-
setDiscoveryStringHandling
public void setDiscoveryStringHandling(TextTypes.StringConversion behavior)
Sets the default behavior for processing string-valued types in discovered schemas. By default, whitespace is not trimmed from values and the empty string is treated as null. If schema discovery is not enabled, this setting is ignored.- Parameters:
behavior
- indicates how string-valued types should be converted from text
-
getHeader
public boolean getHeader()
Indicates whether a header row is expected in the source data.- Returns:
- whether a header row is expected
-
setHeader
public void setHeader(boolean header)
Configures whether to expect a header row in the source. The header row will be skipped when parsing. If schema discovery is enabled, fields will derive their fields from the header. If reading multiple files, all or no files must have a header; a mixture of files with and without headers is not allowed.- Parameters:
header
- indicates whether the source has a header row
-
getHeaderSkipCount
public int getHeaderSkipCount()
Gets the number of lines to skip at the beginning of the file.- Returns:
- the number lines at the start of the file to skip
-
setHeaderSkipCount
public void setHeaderSkipCount(int count)
Sets the number of lines to skip at the beginning of the file before reading. Skipped lines are ignored for discovery purposes (excepting newline discovery). By default, no lines are skipped.- Parameters:
count
- the number lines at the start of the file to skip
-
getLineComment
public String getLineComment()
Gets the character sequence indicating a line comment.- Returns:
- the sequence marking a line comment
-
setLineComment
public void setLineComment(String lineComment)
Sets the character sequence indicating a line comment. If this sequence is discovered immediately following a record, everything up to the next record separator is ignored by the parser.- Parameters:
lineComment
- the character sequence marking the start of a line comment
-
getMaxRowLength
public int getMaxRowLength()
Gets the limit, in characters, for the first row.- Returns:
- the maximum first row length.
-
setMaxRowLength
@Deprecated public void setMaxRowLength(int maxRowLength)
Deprecated.UsesetValidateRecordSeparator(boolean)
in conjunction withsetAnalysisDepth(int)
instead.Sets the limit, in characters, for the first row. If set to0
, no limit is enforced; this is the default.This setting can be used to catch errors involving a misconfigured record separator prior to graph execution. If a limit is configured, the source will be read just prior to executing the graph. If no record separator is found prior to reaching the limit, a
RowTooLongException
is thrown.- Parameters:
maxRowLength
- the limit on the size of the first row- See Also:
setAutoDiscoverNewline(boolean)
,setRecordSeparator(String)
-
getDelimiters
public FieldDelimiterSettings getDelimiters()
Gets the field delimiter settings used by the reader.- Returns:
- the field delimiter settings
-
setDelimiters
public void setDelimiters(FieldDelimiterSettings settings)
Sets the field delimiter settings for the reader. This sets all field delimiter settings at once.- Parameters:
settings
- the field delimiter settings to use
-
getValidateRecordSeparator
public boolean getValidateRecordSeparator()
Gets whether the configured record separator should be validated.- Returns:
true
if the record separator will be validated during analysis,false
otherwise
-
setValidateRecordSeparator
public void setValidateRecordSeparator(boolean enabled)
Sets whether the configured record separator should be validated. If enabled and no record separator is found prior to reaching the configured value forgetAnalysisDepth()
, an error will be raised during file analysis prior to execution. This setting is only meaningful if a specific record separator has been provided. It is ignored if automated discovery of the record separator is enabled.- Parameters:
enabled
- indicates whether to enable record separator validation
-
getRecordSeparator
public String getRecordSeparator()
Gets the value used as a record separator.- Returns:
- the text value of the record separator
-
setRecordSeparator
public void setRecordSeparator(String separator)
Sets the value to use as a record separator. The separator is used to parse the input text into records.By default the record separator is set to the default record separator for the installed operating system of the execution environment.
- Parameters:
separator
- the value to use as a record separator- Throws:
com.pervasive.datarush.graphs.physical.InvalidPropertyValueException
- if the separator isnull
or the empty string- See Also:
setAutoDiscoverNewline(boolean)
-
getAutoDiscoverNewline
public boolean getAutoDiscoverNewline()
Indicates whether the reader should attempt to discover the newline style (UNIX or DOS) used in the source.- Returns:
- whether the newline style should be discovered
-
setAutoDiscoverNewline
public void setAutoDiscoverNewline(boolean enabled)
Configures whether the reader attempts to discover the newline style (UNIX or DOS) used in the source. The discovered newline is then used as the record separator. If enabled, the source will be read just prior to graph execution. If reading multiple files, the newline style is determined using the first file.- Parameters:
enabled
- indicates whether to enable newline discovery
-
getFieldSeparator
public String getFieldSeparator()
Gets the delimiter used to distinguish field boundaries.- Returns:
- the string used to separate fields
-
setFieldSeparator
public void setFieldSeparator(String separator)
Sets the delimiter used to define the boundary between data fields.- Parameters:
separator
- string used to separate fields- Throws:
com.pervasive.datarush.graphs.physical.InvalidPropertyValueException
- if the delimiter isnull
or the empty string
-
setFieldDelimiter
public void setFieldDelimiter(String delimiter)
Sets the delimiter used to denote the boundaries of a data field.This method is generally equivalent to calling
setFieldStartDelimiter()
andsetFieldEndDelimiter()
with the same parameter values. However, those methods do not allow the empty string as a parameter.- Parameters:
delimiter
- string used to optionally mark the start and end of a field value. An empty string indicates field values are not delimited.- Throws:
com.pervasive.datarush.graphs.physical.InvalidPropertyValueException
- if the delimiter isnull
-
getFieldStartDelimiter
public String getFieldStartDelimiter()
Gets the start of field delimiter.- Returns:
- the string used to mark the beginning of a field value
-
setFieldStartDelimiter
public void setFieldStartDelimiter(String delimiter)
Sets the delimiter used to denote the beginning of a data field. It not permitted to set the start delimiter to the empty string; usesetFieldDelimiter(String)
instead to indicate no delimiters.- Parameters:
delimiter
- string used to mark the start of a field value- Throws:
com.pervasive.datarush.graphs.physical.InvalidPropertyValueException
- if the delimiter isnull
or the empty string
-
getFieldEndDelimiter
public String getFieldEndDelimiter()
Gets the end of field delimiter.- Returns:
- the string used to mark the end of a field value
-
setFieldEndDelimiter
public void setFieldEndDelimiter(String delimiter)
Sets the delimiter used to denote the end of a data field. It not permitted to set the end delimiter to the empty string; usesetFieldDelimiter(String)
instead to indicate no delimiters.- Parameters:
delimiter
- string used to mark the start of a field value- Throws:
com.pervasive.datarush.graphs.physical.InvalidPropertyValueException
- if the delimiter isnull
or the empty string
-
computeFormat
protected DataFormat computeFormat(CompositionContext ctx)
Description copied from class:AbstractReader
Determines the data format for the source. The returned format is used during composition to construct aReadSource
operator. If an implementation supports schema discovery, it must be performed in this method.- Specified by:
computeFormat
in classAbstractReader
- Parameters:
ctx
- the composition context for the current invocation ofAbstractReader.compose(CompositionContext)
- Returns:
- the source format to use
-
autoConfigure
public void autoConfigure(FileClient ctx) throws IOException
Performs any configured discovery on the operator using the current source and applies the result to configuration. After execution, the operator will be configured so not to require any pre-execution analysis of the source. In particular:- All delimiter settings will be discovered, if necessary, and set; discovery of these settings will subsequently be disabled.
- A schema will be discovered, if necessary, and set; schema discovery will subsequently be disabled.
- The record separator will be validated, if necessary; record separator validation will subsequently be disabled.
- Parameters:
ctx
- the authorization context to use for accessing the source- Throws:
IOException
- if errors occur during discovery and analysis of the source
-
discoverSchema
public RecordTextSchema<?> discoverSchema(FileClient ctx)
Run schema discovery using current configuration.- Parameters:
ctx
- the authorization context to use for accessing the file- Returns:
- the predicted schema of the source
-
-