Class ReadDelimitedText

  • All Implemented Interfaces:
    LogicalOperator, RecordSourceOperator, SourceOperator<RecordPort>

    public final class ReadDelimitedText
    extends AbstractTextReader
    Reads a text file of delimited records as record tokens. Records are identified by the presence of a non-empty, user-defined record separator sequence between each individual record. Output records contain the same fields as the input text. The reader can also filter and/or reorder the fields of the output, as requested.

    Delimited text supports up to three distinct user-defined sequences within a record, used to identify field boundaries:

    • a field separator, found between individual fields; by default, this is "," (comma).
    • a field start delimiter, marking the beginning of a field value; by default, this is "\"" (double quote).
    • a field end delimiter, marking the end of a field value; by default, this is "\"" (double quote).
    The field separator cannot be empty. The start and end delimiters can be the same value. They can also both (but not individually) be empty, signifying the absence of field delimiters. It is not expected that all fields start and end with a delimiter, though if one starts with a delimiter it must end with one. Fields containing significant characters, such as whitespace and the record and field separators, must be delimited to avoid parsing errors. Should a delimited field need to contain the end delimiter, it is escaped from its normal interpretation by duplicating it. For instance, the value "ab""c" represents a delimited field value of ab"c.

    The reader supports incomplete specification of the separators and delimiters. By default, it will attempt to automatically discover these values based on analysis of a sample of the file. See DelimitedTextAnalyzer for more information. It is strongly suggested, however, that this discovery ability not be relied upon if these values are already known, as it cannot be guaranteed to produce desirable results in all cases.

    The reader requires a RecordTextSchema to provide parsing and type information for fields. The schema, in conjunction with any specified field filter, defines the output type of the reader. This can be manually constructed via the API provided, although this metadata is often persisted externally. StructuredSchemaReader provides support for reading in Pervasive DataIntegrator structured schema descriptors (.schema files) for use with readers. Because delimited text has explicit field markers, it is also possible to perform automated discovery of the schema based on the contents of the file; the reader provides a pluggable discovery mechanism to support this function. By default, the schema will be automatically discovered, with all fields assumed to be strings. Discovered fields are named using the header row if present. Otherwise, names are sequentially generated.

    Normally, the output of the reader includes all records in the file, both those with and without parsing errors. Fields which can not be parsed are null valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.

    Delimited text data may or may not have a header row. The header row is delimited as usual but contains the names of the fields in the data portion of the record. The reader must be told whether a header row exists or not. If it does, the parser will skip the header row; otherwise the first row is treated as a record and will appear in the output.

    Delimited text files can be parsed in parallel under "optimistic" assumptions: namely, that parse splits do not occur in the middle of a delimited field value and somewhere before an escaped record separator. This is assumed by default, but can be disabled, with an accompanying reduction of scalability and performance.

    • Field Detail

      • DEFAULT_ANALYSIS_DEPTH

        public static final int DEFAULT_ANALYSIS_DEPTH
        The default number of characters analyzed when performing structure and schema discovery
        See Also:
        Constant Field Values
    • Constructor Detail

      • ReadDelimitedText

        public ReadDelimitedText​(String pattern)
        Reads all paths matching the specified pattern as delimited text using default options. Any matching path which is a directory is replaced with all files in the directory; this expansion is not applied recursively.

        A default schema of all string typed fields will be generated based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema) or setSchemaDiscovery(TextRecordDiscoverer).

        Parameters:
        pattern - a path-matching pattern
        See Also:
        FileClient.matchPaths(String)
      • ReadDelimitedText

        public ReadDelimitedText​(Path path)
        Reads the file specified by the path as delimited text using default options. If the path refers to a a directory, all files in the directory are read; this expansion is not applied recursively.

        A default schema of all string typed fields will be generated based on analysis of the source, unless otherwise configured via setSchema(RecordTextSchema) or setSchemaDiscovery(TextRecordDiscoverer).

        Parameters:
        path - the path to read
    • Method Detail

      • clone

        public ReadDelimitedText clone()
        Creates a copy of the reader with identical settings. This is a deep copy; subsequent changes in the reader are not reflected in the clone and vice-versa.
        Overrides:
        clone in class Object
      • getAnalysisDepth

        public int getAnalysisDepth()
        Gets the number of characters to read for schema discovery and structural analysis of the file.
        Returns:
        the number of characters which will be analyzed
      • setAnalysisDepth

        public void setAnalysisDepth​(int count)
        Sets the number of characters to read for performing schema discovery and structural analysis. This setting is ignored if no discovery is being performed; that is, separators and delimiters are fully specified and a schema is provided. The default setting is 1M characters.
        Parameters:
        count - the number of characters to use to determine the schema and/or file structure
      • getSchema

        public RecordTextSchema<?> getSchema()
        Gets the record schema of the delimited text source. If schema discovery is enabled, this will return null.
        Returns:
        the record schema of the source
      • getSchemaDiscovery

        public TextRecordDiscoverer getSchemaDiscovery()
        Gets the schema discoverer to use on the delimited text source. If schema discovery is disabled, this will return null.
        Returns:
        the configured schema discoverer
      • setSchemaDiscovery

        public void setSchemaDiscovery​(TextRecordDiscoverer discoverer)
        Sets the schema discoverer to use against the delimited text source. Just prior to graph execution the source will be examined using the discoverer to determine the output schema for records. If reading multiple files, the schema is determined using the first file. Output records will have the discovered schema, adjusted accordingly for any configured field selection.

        By default, the schema will be discovered automatically. All fields are assumed to be strings and the field names are taken from the header row. If no header is present, field names will be generated in sequence: "field0", "field1", ...

        Setting schema discovery overrides any previously configured schema.

        Parameters:
        discoverer - the schema discoverer to use.
        See Also:
        setSchema(RecordTextSchema), AbstractReader.setSelectedFields(java.util.List)
      • setSchemaDiscovery

        public void setSchemaDiscovery​(List<TypePattern> patterns)
        Enables schema discovery using the default discoverer extended with additional typing patterns. The additional patterns are in addition to, not in place of, the normal discovery typing patterns. If overriding default rules is desired, use setSchemaDiscovery(TextRecordDiscoverer) with an appropriately configured discoverer instead.
        Parameters:
        patterns - the additional patterns to apply at lower precedence than default patterns
        See Also:
        PatternBasedDiscovery
      • getDiscoveryNullIndicator

        public String getDiscoveryNullIndicator()
        Gets the text value used to represent null values by default in discovered schemas.
        Returns:
        the string indicating a null value
      • setDiscoveryNullIndicator

        public void setDiscoveryNullIndicator​(String value)
        Sets the text value used to represent null values by default in discovered schemas. By default, this is the empty string. If schema discovery is not enabled, this setting is ignored.
        Parameters:
        value - the string indicating a null value
      • getDiscoveryStringHandling

        public TextTypes.StringConversion getDiscoveryStringHandling()
        Gets the default behavior for processing string-valued types in discovered schemas.
        Returns:
        how string-valued types should be converted from text
      • setDiscoveryStringHandling

        public void setDiscoveryStringHandling​(TextTypes.StringConversion behavior)
        Sets the default behavior for processing string-valued types in discovered schemas. By default, whitespace is not trimmed from values and the empty string is treated as null. If schema discovery is not enabled, this setting is ignored.
        Parameters:
        behavior - indicates how string-valued types should be converted from text
      • getHeader

        public boolean getHeader()
        Indicates whether a header row is expected in the source data.
        Returns:
        whether a header row is expected
      • setHeader

        public void setHeader​(boolean header)
        Configures whether to expect a header row in the source. The header row will be skipped when parsing. If schema discovery is enabled, fields will derive their fields from the header. If reading multiple files, all or no files must have a header; a mixture of files with and without headers is not allowed.
        Parameters:
        header - indicates whether the source has a header row
      • getHeaderSkipCount

        public int getHeaderSkipCount()
        Gets the number of lines to skip at the beginning of the file.
        Returns:
        the number lines at the start of the file to skip
      • setHeaderSkipCount

        public void setHeaderSkipCount​(int count)
        Sets the number of lines to skip at the beginning of the file before reading. Skipped lines are ignored for discovery purposes (excepting newline discovery). By default, no lines are skipped.
        Parameters:
        count - the number lines at the start of the file to skip
      • getLineComment

        public String getLineComment()
        Gets the character sequence indicating a line comment.
        Returns:
        the sequence marking a line comment
      • setLineComment

        public void setLineComment​(String lineComment)
        Sets the character sequence indicating a line comment. If this sequence is discovered immediately following a record, everything up to the next record separator is ignored by the parser.
        Parameters:
        lineComment - the character sequence marking the start of a line comment
      • getMaxRowLength

        public int getMaxRowLength()
        Gets the limit, in characters, for the first row.
        Returns:
        the maximum first row length.
      • setMaxRowLength

        @Deprecated
        public void setMaxRowLength​(int maxRowLength)
        Deprecated.
        Sets the limit, in characters, for the first row. If set to 0, no limit is enforced; this is the default.

        This setting can be used to catch errors involving a misconfigured record separator prior to graph execution. If a limit is configured, the source will be read just prior to executing the graph. If no record separator is found prior to reaching the limit, a RowTooLongException is thrown.

        Parameters:
        maxRowLength - the limit on the size of the first row
        See Also:
        setAutoDiscoverNewline(boolean), setRecordSeparator(String)
      • getDelimiters

        public FieldDelimiterSettings getDelimiters()
        Gets the field delimiter settings used by the reader.
        Returns:
        the field delimiter settings
      • setDelimiters

        public void setDelimiters​(FieldDelimiterSettings settings)
        Sets the field delimiter settings for the reader. This sets all field delimiter settings at once.
        Parameters:
        settings - the field delimiter settings to use
      • getValidateRecordSeparator

        public boolean getValidateRecordSeparator()
        Gets whether the configured record separator should be validated.
        Returns:
        true if the record separator will be validated during analysis, false otherwise
      • setValidateRecordSeparator

        public void setValidateRecordSeparator​(boolean enabled)
        Sets whether the configured record separator should be validated. If enabled and no record separator is found prior to reaching the configured value for getAnalysisDepth(), an error will be raised during file analysis prior to execution. This setting is only meaningful if a specific record separator has been provided. It is ignored if automated discovery of the record separator is enabled.
        Parameters:
        enabled - indicates whether to enable record separator validation
      • getRecordSeparator

        public String getRecordSeparator()
        Gets the value used as a record separator.
        Returns:
        the text value of the record separator
      • setRecordSeparator

        public void setRecordSeparator​(String separator)
        Sets the value to use as a record separator. The separator is used to parse the input text into records.

        By default the record separator is set to the default record separator for the installed operating system of the execution environment.

        Parameters:
        separator - the value to use as a record separator
        Throws:
        com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the separator is null or the empty string
        See Also:
        setAutoDiscoverNewline(boolean)
      • getAutoDiscoverNewline

        public boolean getAutoDiscoverNewline()
        Indicates whether the reader should attempt to discover the newline style (UNIX or DOS) used in the source.
        Returns:
        whether the newline style should be discovered
      • setAutoDiscoverNewline

        public void setAutoDiscoverNewline​(boolean enabled)
        Configures whether the reader attempts to discover the newline style (UNIX or DOS) used in the source. The discovered newline is then used as the record separator. If enabled, the source will be read just prior to graph execution. If reading multiple files, the newline style is determined using the first file.
        Parameters:
        enabled - indicates whether to enable newline discovery
      • getFieldSeparator

        public String getFieldSeparator()
        Gets the delimiter used to distinguish field boundaries.
        Returns:
        the string used to separate fields
      • setFieldSeparator

        public void setFieldSeparator​(String separator)
        Sets the delimiter used to define the boundary between data fields.
        Parameters:
        separator - string used to separate fields
        Throws:
        com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the delimiter is null or the empty string
      • setFieldDelimiter

        public void setFieldDelimiter​(String delimiter)
        Sets the delimiter used to denote the boundaries of a data field.

        This method is generally equivalent to calling setFieldStartDelimiter() and setFieldEndDelimiter() with the same parameter values. However, those methods do not allow the empty string as a parameter.

        Parameters:
        delimiter - string used to optionally mark the start and end of a field value. An empty string indicates field values are not delimited.
        Throws:
        com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the delimiter is null
      • getFieldStartDelimiter

        public String getFieldStartDelimiter()
        Gets the start of field delimiter.
        Returns:
        the string used to mark the beginning of a field value
      • setFieldStartDelimiter

        public void setFieldStartDelimiter​(String delimiter)
        Sets the delimiter used to denote the beginning of a data field. It not permitted to set the start delimiter to the empty string; use setFieldDelimiter(String) instead to indicate no delimiters.
        Parameters:
        delimiter - string used to mark the start of a field value
        Throws:
        com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the delimiter is null or the empty string
      • getFieldEndDelimiter

        public String getFieldEndDelimiter()
        Gets the end of field delimiter.
        Returns:
        the string used to mark the end of a field value
      • setFieldEndDelimiter

        public void setFieldEndDelimiter​(String delimiter)
        Sets the delimiter used to denote the end of a data field. It not permitted to set the end delimiter to the empty string; use setFieldDelimiter(String) instead to indicate no delimiters.
        Parameters:
        delimiter - string used to mark the start of a field value
        Throws:
        com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the delimiter is null or the empty string
      • autoConfigure

        public void autoConfigure​(FileClient ctx)
                           throws IOException
        Performs any configured discovery on the operator using the current source and applies the result to configuration. After execution, the operator will be configured so not to require any pre-execution analysis of the source. In particular:
        • All delimiter settings will be discovered, if necessary, and set; discovery of these settings will subsequently be disabled.
        • A schema will be discovered, if necessary, and set; schema discovery will subsequently be disabled.
        • The record separator will be validated, if necessary; record separator validation will subsequently be disabled.
        Parameters:
        ctx - the authorization context to use for accessing the source
        Throws:
        IOException - if errors occur during discovery and analysis of the source
      • discoverSchema

        public RecordTextSchema<?> discoverSchema​(FileClient ctx)
        Run schema discovery using current configuration.
        Parameters:
        ctx - the authorization context to use for accessing the file
        Returns:
        the predicted schema of the source