ReadDelimitedText (Dataflow Library Distribution Project 7.0.2-37 API)

java.lang.Object
- com.pervasive.datarush.operators.AbstractLogicalOperator
- - com.pervasive.datarush.operators.CompositeOperator
  - - com.pervasive.datarush.operators.io.AbstractReader
    - - com.pervasive.datarush.operators.io.textfile.AbstractTextReader
      - com.pervasive.datarush.operators.io.textfile.ReadDelimitedText

All Implemented Interfaces:

LogicalOperator, RecordSourceOperator, SourceOperator<RecordPort>
```
public final class ReadDelimitedText
extends AbstractTextReader
```
Reads a text file of delimited records as record tokens. Records are identified by the presence of a non-empty, user-defined record separator sequence between each individual record. Output records contain the same fields as the input text. The reader can also filter and/or reorder the fields of the output, as requested.
Delimited text supports up to three distinct user-defined sequences within a record, used to identify field boundaries:
- a field separator, found between individual fields; by default, this is "," (comma).
- a field start delimiter, marking the beginning of a field value; by default, this is "\"" (double quote).
- a field end delimiter, marking the end of a field value; by default, this is "\"" (double quote).
The field separator cannot be empty. The start and end delimiters can be the same value. They can also both (but not individually) be empty, signifying the absence of field delimiters. It is not expected that all fields start and end with a delimiter, though if one starts with a delimiter it must end with one. Fields containing significant characters, such as whitespace and the record and field separators, must be delimited to avoid parsing errors. Should a delimited field need to contain the end delimiter, it is escaped from its normal interpretation by duplicating it. For instance, the value "ab""c" represents a delimited field value of ab"c.
The reader supports incomplete specification of the separators and delimiters. By default, it will attempt to automatically discover these values based on analysis of a sample of the file. See DelimitedTextAnalyzer for more information. It is strongly suggested, however, that this discovery ability not be relied upon if these values are already known, as it cannot be guaranteed to produce desirable results in all cases.
The reader requires a RecordTextSchema to provide parsing and type information for fields. The schema, in conjunction with any specified field filter, defines the output type of the reader. This can be manually constructed via the API provided, although this metadata is often persisted externally. StructuredSchemaReader provides support for reading in Pervasive DataIntegrator structured schema descriptors (.schema files) for use with readers. Because delimited text has explicit field markers, it is also possible to perform automated discovery of the schema based on the contents of the file; the reader provides a pluggable discovery mechanism to support this function. By default, the schema will be automatically discovered, with all fields assumed to be strings. Discovered fields are named using the header row if present. Otherwise, names are sequentially generated.
Normally, the output of the reader includes all records in the file, both those with and without parsing errors. Fields which can not be parsed are null valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.
Delimited text data may or may not have a header row. The header row is delimited as usual but contains the names of the fields in the data portion of the record. The reader must be told whether a header row exists or not. If it does, the parser will skip the header row; otherwise the first row is treated as a record and will appear in the output.
Delimited text files can be parsed in parallel under "optimistic" assumptions: namely, that parse splits do not occur in the middle of a delimited field value and somewhere before an escaped record separator. This is assumed by default, but can be disabled, with an accompanying reduction of scalability and performance.

Field Summary

Fields
Modifier and Type Field and Description

static int DEFAULT_ANALYSIS_DEPTH
The default number of characters analyzed when performing structure and schema discovery
- Fields inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader
  encodingProps
- Fields inherited from class com.pervasive.datarush.operators.io.AbstractReader
  options, output

Fields
Modifier and Type	Field and Description
`static int`	`DEFAULT_ANALYSIS_DEPTH` The default number of characters analyzed when performing structure and schema discovery

Constructor Summary

Constructors
Constructor and Description
`ReadDelimitedText()` Reads an empty source with default settings.
`ReadDelimitedText(ByteSource source)` Reads the specified data source using default options.
`ReadDelimitedText(Path path)` Reads the file specified by the path as delimited text using default options.
`ReadDelimitedText(String pattern)` Reads all paths matching the specified pattern as delimited text using default options.

Method Summary

All Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`void`	`autoConfigure(FileClient ctx)` Performs any configured discovery on the operator using the current source and applies the result to configuration.
`ReadDelimitedText`	`clone()` Creates a copy of the reader with identical settings.
`protected DataFormat`	`computeFormat(CompositionContext ctx)` Determines the data format for the source.
`RecordTextSchema<?>`	`discoverSchema(FileClient ctx)` Run schema discovery using current configuration.
`int`	`getAnalysisDepth()` Gets the number of characters to read for schema discovery and structural analysis of the file.
`boolean`	`getAutoDiscoverNewline()` Indicates whether the reader should attempt to discover the newline style (UNIX or DOS) used in the source.
`FieldDelimiterSettings`	`getDelimiters()` Gets the field delimiter settings used by the reader.
`String`	`getDiscoveryNullIndicator()` Gets the text value used to represent null values by default in discovered schemas.
`TextTypes.StringConversion`	`getDiscoveryStringHandling()` Gets the default behavior for processing string-valued types in discovered schemas.
`String`	`getFieldEndDelimiter()` Gets the end of field delimiter.
`String`	`getFieldSeparator()` Gets the delimiter used to distinguish field boundaries.
`String`	`getFieldStartDelimiter()` Gets the start of field delimiter.
`boolean`	`getHeader()` Indicates whether a header row is expected in the source data.
`int`	`getHeaderSkipCount()` Gets the number of lines to skip at the beginning of the file.
`String`	`getLineComment()` Gets the character sequence indicating a line comment.
`int`	`getMaxRowLength()` Gets the limit, in characters, for the first row.
`String`	`getRecordSeparator()` Gets the value used as a record separator.
`RecordTextSchema<?>`	`getSchema()` Gets the record schema of the delimited text source.
`TextRecordDiscoverer`	`getSchemaDiscovery()` Gets the schema discoverer to use on the delimited text source.
`boolean`	`getValidateRecordSeparator()` Gets whether the configured record separator should be validated.
`void`	`setAnalysisDepth(int count)` Sets the number of characters to read for performing schema discovery and structural analysis.
`void`	`setAutoDiscoverNewline(boolean enabled)` Configures whether the reader attempts to discover the newline style (UNIX or DOS) used in the source.
`void`	`setDelimiters(FieldDelimiterSettings settings)` Sets the field delimiter settings for the reader.
`void`	`setDiscoveryNullIndicator(String value)` Sets the text value used to represent null values by default in discovered schemas.
`void`	`setDiscoveryStringHandling(TextTypes.StringConversion behavior)` Sets the default behavior for processing string-valued types in discovered schemas.
`void`	`setFieldDelimiter(String delimiter)` Sets the delimiter used to denote the boundaries of a data field.
`void`	`setFieldEndDelimiter(String delimiter)` Sets the delimiter used to denote the end of a data field.
`void`	`setFieldSeparator(String separator)` Sets the delimiter used to define the boundary between data fields.
`void`	`setFieldStartDelimiter(String delimiter)` Sets the delimiter used to denote the beginning of a data field.
`void`	`setHeader(boolean header)` Configures whether to expect a header row in the source.
`void`	`setHeaderSkipCount(int count)` Sets the number of lines to skip at the beginning of the file before reading.
`void`	`setLineComment(String lineComment)` Sets the character sequence indicating a line comment.
`void`	`setMaxRowLength(int maxRowLength)` Deprecated. Use `setValidateRecordSeparator(boolean)` in conjunction with `setAnalysisDepth(int)` instead.
`void`	`setRecordSeparator(String separator)` Sets the value to use as a record separator.
`void`	`setSchema(RecordTextSchema<?> schema)` Sets the record schema expected in the delimited text source.
`void`	`setSchemaDiscovery(List<TypePattern> patterns)` Enables schema discovery using the default discoverer extended with additional typing patterns.
`void`	`setSchemaDiscovery(TextRecordDiscoverer discoverer)` Sets the schema discoverer to use against the delimited text source.
`void`	`setValidateRecordSeparator(boolean enabled)` Sets whether the configured record separator should be validated.

Methods inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader
getCharset, getCharsetName, getDecodeBuffer, getEncoding, getErrorAction, getReplacement, setCharset, setCharsetName, setDecodeBuffer, setEncoding, setErrorAction, setReplacement

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface com.pervasive.datarush.operators.LogicalOperator
disableParallelism, getInputPorts, getOutputPorts

- Field Detail
  - DEFAULT_ANALYSIS_DEPTH
```
public static final int DEFAULT_ANALYSIS_DEPTH
```
    The default number of characters analyzed when performing structure and schema discovery
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - ReadDelimitedText
```
public ReadDelimitedText()
```
    Reads an empty source with default settings. The source must be set before execution or an error will be raised.
    A default schema of all string typed fields will be generated based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema) or setSchemaDiscovery(TextRecordDiscoverer).
    
    See Also:
    
    AbstractReader.setSource(ByteSource)
  - ReadDelimitedText
```
public ReadDelimitedText(String pattern)
```
    Reads all paths matching the specified pattern as delimited text using default options. Any matching path which is a directory is replaced with all files in the directory; this expansion is not applied recursively.
    A default schema of all string typed fields will be generated based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema) or setSchemaDiscovery(TextRecordDiscoverer).
    
    Parameters:
    
    pattern - a path-matching pattern
    
    See Also:
    
    FileClient.matchPaths(String)
  - ReadDelimitedText
```
public ReadDelimitedText(Path path)
```
    Reads the file specified by the path as delimited text using default options. If the path refers to a a directory, all files in the directory are read; this expansion is not applied recursively.
    A default schema of all string typed fields will be generated based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema) or setSchemaDiscovery(TextRecordDiscoverer).
    
    Parameters:
    
    path - the path to read
  - ReadDelimitedText
```
public ReadDelimitedText(ByteSource source)
```
    Reads the specified data source using default options.
    A default schema of all string typed fields will be generated based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema) or setSchemaDiscovery(TextRecordDiscoverer).
    
    Parameters:
    
    source - the data source to read
- Method Detail
  - clone
```
public ReadDelimitedText clone()
```
    Creates a copy of the reader with identical settings. This is a deep copy; subsequent changes in the reader are not reflected in the clone and vice-versa.
    
    Overrides:
    
    clone in class Object
  - getAnalysisDepth
```
public int getAnalysisDepth()
```
    Gets the number of characters to read for schema discovery and structural analysis of the file.
    
    Returns:
    
    the number of characters which will be analyzed
  - setAnalysisDepth
```
public void setAnalysisDepth(int count)
```
    Sets the number of characters to read for performing schema discovery and structural analysis. This setting is ignored if no discovery is being performed; that is, separators and delimiters are fully specified and a schema is provided. The default setting is 1M characters.
    
    Parameters:
    
    count - the number of characters to use to determine the schema and/or file structure
  - getSchema
```
public RecordTextSchema<?> getSchema()
```
    Gets the record schema of the delimited text source. If schema discovery is enabled, this will return null.
    
    Returns:
    
    the record schema of the source
  - setSchema
```
public void setSchema(RecordTextSchema<?> schema)
```
    Sets the record schema expected in the delimited text source. Output records will have this schema, adjusted accordingly for any configured field selection.
    Setting a schema overrides any previously configured schema discovery.
    
    Parameters:
    
    schema - the expected record schema of the source
    
    See Also:
    
    setSchemaDiscovery(TextRecordDiscoverer), AbstractReader.setSelectedFields(java.util.List)
  - getSchemaDiscovery
```
public TextRecordDiscoverer getSchemaDiscovery()
```
    Gets the schema discoverer to use on the delimited text source. If schema discovery is disabled, this will return null.
    
    Returns:
    
    the configured schema discoverer
  - setSchemaDiscovery
```
public void setSchemaDiscovery(TextRecordDiscoverer discoverer)
```
    Sets the schema discoverer to use against the delimited text source. Just prior to graph execution the source will be examined using the discoverer to determine the output schema for records. If reading multiple files, the schema is determined using the first file. Output records will have the discovered schema, adjusted accordingly for any configured field selection.
    By default, the schema will be discovered automatically. All fields are assumed to be strings and the field names are taken from the header row. If no header is present, field names will be generated in sequence: "field0", "field1", ...
    Setting schema discovery overrides any previously configured schema.
    
    Parameters:
    
    discoverer - the schema discoverer to use.
    
    See Also:
    
    setSchema(RecordTextSchema), AbstractReader.setSelectedFields(java.util.List)
  - setSchemaDiscovery
```
public void setSchemaDiscovery(List<TypePattern> patterns)
```
    Enables schema discovery using the default discoverer extended with additional typing patterns. The additional patterns are in addition to, not in place of, the normal discovery typing patterns. If overriding default rules is desired, use setSchemaDiscovery(TextRecordDiscoverer) with an appropriately configured discoverer instead.
    
    Parameters:
    
    patterns - the additional patterns to apply at lower precedence than default patterns
    
    See Also:
    
    PatternBasedDiscovery
  - getDiscoveryNullIndicator
```
public String getDiscoveryNullIndicator()
```
    Gets the text value used to represent null values by default in discovered schemas.
    
    Returns:
    
    the string indicating a null value
  - setDiscoveryNullIndicator
```
public void setDiscoveryNullIndicator(String value)
```
    Sets the text value used to represent null values by default in discovered schemas. By default, this is the empty string. If schema discovery is not enabled, this setting is ignored.
    
    Parameters:
    
    value - the string indicating a null value
  - getDiscoveryStringHandling
```
public TextTypes.StringConversion getDiscoveryStringHandling()
```
    Gets the default behavior for processing string-valued types in discovered schemas.
    
    Returns:
    
    how string-valued types should be converted from text
  - setDiscoveryStringHandling
```
public void setDiscoveryStringHandling(TextTypes.StringConversion behavior)
```
    Sets the default behavior for processing string-valued types in discovered schemas. By default, whitespace is not trimmed from values and the empty string is treated as null. If schema discovery is not enabled, this setting is ignored.
    
    Parameters:
    
    behavior - indicates how string-valued types should be converted from text
  - getHeader
```
public boolean getHeader()
```
    Indicates whether a header row is expected in the source data.
    
    Returns:
    
    whether a header row is expected
  - setHeader
```
public void setHeader(boolean header)
```
    Configures whether to expect a header row in the source. The header row will be skipped when parsing. If schema discovery is enabled, fields will derive their fields from the header. If reading multiple files, all or no files must have a header; a mixture of files with and without headers is not allowed.
    
    Parameters:
    
    header - indicates whether the source has a header row
  - getHeaderSkipCount
```
public int getHeaderSkipCount()
```
    Gets the number of lines to skip at the beginning of the file.
    
    Returns:
    
    the number lines at the start of the file to skip
  - setHeaderSkipCount
```
public void setHeaderSkipCount(int count)
```
    Sets the number of lines to skip at the beginning of the file before reading. Skipped lines are ignored for discovery purposes (excepting newline discovery). By default, no lines are skipped.
    
    Parameters:
    
    count - the number lines at the start of the file to skip
  - getLineComment
```
public String getLineComment()
```
    Gets the character sequence indicating a line comment.
    
    Returns:
    
    the sequence marking a line comment
  - setLineComment
```
public void setLineComment(String lineComment)
```
    Sets the character sequence indicating a line comment. If this sequence is discovered immediately following a record, everything up to the next record separator is ignored by the parser.
    
    Parameters:
    
    lineComment - the character sequence marking the start of a line comment
  - getMaxRowLength
```
public int getMaxRowLength()
```
    Gets the limit, in characters, for the first row.
    
    Returns:
    
    the maximum first row length.
  - setMaxRowLength
```
@Deprecated
public void setMaxRowLength(int maxRowLength)
```
    Deprecated. Use setValidateRecordSeparator(boolean) in conjunction with setAnalysisDepth(int) instead.
    
    Sets the limit, in characters, for the first row. If set to 0, no limit is enforced; this is the default.
    This setting can be used to catch errors involving a misconfigured record separator prior to graph execution. If a limit is configured, the source will be read just prior to executing the graph. If no record separator is found prior to reaching the limit, a RowTooLongException is thrown.
    
    Parameters:
    
    maxRowLength - the limit on the size of the first row
    
    See Also:
    
    setAutoDiscoverNewline(boolean), setRecordSeparator(String)
  - getDelimiters
```
public FieldDelimiterSettings getDelimiters()
```
    Gets the field delimiter settings used by the reader.
    
    Returns:
    
    the field delimiter settings
  - setDelimiters
```
public void setDelimiters(FieldDelimiterSettings settings)
```
    Sets the field delimiter settings for the reader. This sets all field delimiter settings at once.
    
    Parameters:
    
    settings - the field delimiter settings to use
  - getValidateRecordSeparator
```
public boolean getValidateRecordSeparator()
```
    Gets whether the configured record separator should be validated.
    
    Returns:
    
    true if the record separator will be validated during analysis, false otherwise
  - setValidateRecordSeparator
```
public void setValidateRecordSeparator(boolean enabled)
```
    Sets whether the configured record separator should be validated. If enabled and no record separator is found prior to reaching the configured value for getAnalysisDepth(), an error will be raised during file analysis prior to execution. This setting is only meaningful if a specific record separator has been provided. It is ignored if automated discovery of the record separator is enabled.
    
    Parameters:
    
    enabled - indicates whether to enable record separator validation
  - getRecordSeparator
```
public String getRecordSeparator()
```
    Gets the value used as a record separator.
    
    Returns:
    
    the text value of the record separator
  - setRecordSeparator
```
public void setRecordSeparator(String separator)
```
    Sets the value to use as a record separator. The separator is used to parse the input text into records.
    By default the record separator is set to the default record separator for the installed operating system of the execution environment.
    
    Parameters:
    
    separator - the value to use as a record separator
    
    Throws:
    
    com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the separator is null or the empty string
    
    See Also:
    
    setAutoDiscoverNewline(boolean)
  - getAutoDiscoverNewline
```
public boolean getAutoDiscoverNewline()
```
    Indicates whether the reader should attempt to discover the newline style (UNIX or DOS) used in the source.
    
    Returns:
    
    whether the newline style should be discovered
  - setAutoDiscoverNewline
```
public void setAutoDiscoverNewline(boolean enabled)
```
    Configures whether the reader attempts to discover the newline style (UNIX or DOS) used in the source. The discovered newline is then used as the record separator. If enabled, the source will be read just prior to graph execution. If reading multiple files, the newline style is determined using the first file.
    
    Parameters:
    
    enabled - indicates whether to enable newline discovery
  - getFieldSeparator
```
public String getFieldSeparator()
```
    Gets the delimiter used to distinguish field boundaries.
    
    Returns:
    
    the string used to separate fields
  - setFieldSeparator
```
public void setFieldSeparator(String separator)
```
    Sets the delimiter used to define the boundary between data fields.
    
    Parameters:
    
    separator - string used to separate fields
    
    Throws:
    
    com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the delimiter is null or the empty string
  - setFieldDelimiter
```
public void setFieldDelimiter(String delimiter)
```
    Sets the delimiter used to denote the boundaries of a data field.
    This method is generally equivalent to calling setFieldStartDelimiter() and setFieldEndDelimiter() with the same parameter values. However, those methods do not allow the empty string as a parameter.
    
    Parameters:
    
    delimiter - string used to optionally mark the start and end of a field value. An empty string indicates field values are not delimited.
    
    Throws:
    
    com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the delimiter is null
  - getFieldStartDelimiter
```
public String getFieldStartDelimiter()
```
    Gets the start of field delimiter.
    
    Returns:
    
    the string used to mark the beginning of a field value
  - setFieldStartDelimiter
```
public void setFieldStartDelimiter(String delimiter)
```
    Sets the delimiter used to denote the beginning of a data field. It not permitted to set the start delimiter to the empty string; use setFieldDelimiter(String) instead to indicate no delimiters.
    
    Parameters:
    
    delimiter - string used to mark the start of a field value
    
    Throws:
    
    com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the delimiter is null or the empty string
  - getFieldEndDelimiter
```
public String getFieldEndDelimiter()
```
    Gets the end of field delimiter.
    
    Returns:
    
    the string used to mark the end of a field value
  - setFieldEndDelimiter
```
public void setFieldEndDelimiter(String delimiter)
```
    Sets the delimiter used to denote the end of a data field. It not permitted to set the end delimiter to the empty string; use setFieldDelimiter(String) instead to indicate no delimiters.
    
    Parameters:
    
    delimiter - string used to mark the start of a field value
    
    Throws:
    
    com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the delimiter is null or the empty string
  - computeFormat
```
protected DataFormat computeFormat(CompositionContext ctx)
```
    Description copied from class: AbstractReader
    
    Determines the data format for the source. The returned format is used during composition to construct a ReadSource operator. If an implementation supports schema discovery, it must be performed in this method.
    
    Specified by:
    
    computeFormat in class AbstractReader
    
    Parameters:
    
    ctx - the composition context for the current invocation of AbstractReader.compose(CompositionContext)
    
    Returns:
    
    the source format to use
  - autoConfigure
```
public void autoConfigure(FileClient ctx)
                   throws IOException
```
    Performs any configured discovery on the operator using the current source and applies the result to configuration. After execution, the operator will be configured so not to require any pre-execution analysis of the source. In particular:
    - All delimiter settings will be discovered, if necessary, and set; discovery of these settings will subsequently be disabled.
    - A schema will be discovered, if necessary, and set; schema discovery will subsequently be disabled.
    - The record separator will be validated, if necessary; record separator validation will subsequently be disabled.
    Parameters:
    
    ctx - the authorization context to use for accessing the source
    
    Throws:
    
    IOException - if errors occur during discovery and analysis of the source
  - discoverSchema
```
public RecordTextSchema<?> discoverSchema(FileClient ctx)
```
    Run schema discovery using current configuration.
    
    Parameters:
    
    ctx - the authorization context to use for accessing the file
    
    Returns:
    
    the predicted schema of the source

Class ReadDelimitedText

Field Summary

Fields inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader

Fields inherited from class com.pervasive.datarush.operators.io.AbstractReader

Constructor Summary

Method Summary

Methods inherited from class com.pervasive.datarush.operators.io.textfile.AbstractTextReader

Methods inherited from class com.pervasive.datarush.operators.io.AbstractReader

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator

Methods inherited from class java.lang.Object

Methods inherited from interface com.pervasive.datarush.operators.LogicalOperator

Field Detail

DEFAULT_ANALYSIS_DEPTH

Constructor Detail

ReadDelimitedText

ReadDelimitedText

ReadDelimitedText

ReadDelimitedText

Method Detail

clone

getAnalysisDepth

setAnalysisDepth

getSchema

setSchema

getSchemaDiscovery

setSchemaDiscovery

setSchemaDiscovery

getDiscoveryNullIndicator

setDiscoveryNullIndicator

getDiscoveryStringHandling

setDiscoveryStringHandling

getHeader

setHeader

getHeaderSkipCount

setHeaderSkipCount

getLineComment

setLineComment

getMaxRowLength

setMaxRowLength

getDelimiters

setDelimiters

getValidateRecordSeparator

setValidateRecordSeparator

getRecordSeparator

setRecordSeparator

getAutoDiscoverNewline

setAutoDiscoverNewline

getFieldSeparator

setFieldSeparator

setFieldDelimiter

getFieldStartDelimiter

setFieldStartDelimiter

getFieldEndDelimiter

setFieldEndDelimiter

computeFormat

autoConfigure

discoverSchema