com.pervasive.datarush.operators.io.textfile.DelimitedTextAnalyzer

public class DelimitedTextAnalyzer extends Object

An analyzer for files containing delimited text. An analysis can perform a basic parsing of the file, permitting validation of delimiter configuration. The following information is provided as a result of analyzing a file:

The values of fields for analyzed records.
The record separator. If the properties specify auto-detection of newline style, the analyzer will determine whether the file uses Windows-style CRLF or UNIX-style LF.
The field separator. If the properties specify auto-detection of the field separator, the analyzer will attempt to determine the appropriate separator from a known set: comma (','), tab ('\t'), semicolon (';'), pipe ('|'), and space (' ').
The field delimiter. If the properties specify auto-detection of the field separator, the analyzer will attempt to determine the appropriate delimiter from a known set: single quote (') or double quote ("). If one cannot be determined, the text is assumed to be undelimited.
The comment marker. If the properties specify auto-detection of the comment marker, the analyzer will attempt to determine the appropriate comment marker from a known set: #), %, and //. If one cannot be determined, it is assumed there is no comment marker.

This information can be used to generate a schema for the records, but also could be used to provide a preview of how a file would be parsed with given settings.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

DelimitedTextAnalyzer.Analysis

Contains the results of an analysis of a delimited text file.
Constructor Summary

Constructors

Constructor

Description

DelimitedTextAnalyzer(FieldDelimiterSpecifier delimiters)

Creates a new analyzer which uses the given delimiter information.
Method Summary

Modifier and Type

Method

Description

DelimitedTextAnalyzer.Analysis

analyze(Path file, CharsetEncoding charsetSpec)

Analyzes the specified file based on current configuration.

DelimitedTextAnalyzer.Analysis

analyze(Path file, CharsetEncoding charsetSpec, FileClient client)

Analyzes the specified file based on current configuration.

DelimitedTextAnalyzer.Analysis

analyze(Reader input)

Analyzes the specified text stream based on current configuration.

DelimitedTextAnalyzer.Analysis

analyze(String file, CharsetEncoding charsetSpec)

Analyzes the specified file based on current configuration.

void

setAnalysisSize(int count)

Sets the maximum number of characters to use in analysis.

void

setHeaderSkipCount(int count)

Sets the number of lines to skip at the beginning of the file.

void

setLineComment(String lineComment)

Set the value of the indicator that a line is commented and should be ignored.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- DelimitedTextAnalyzer
  
  public DelimitedTextAnalyzer(FieldDelimiterSpecifier delimiters)
  
  Creates a new analyzer which uses the given delimiter information. Initially, the analyzer is configured to allow unlimited record length and only parses the first row.
  
  Parameters:
  
  delimiters - field structure information from which to initialize settings
Method Details
- setAnalysisSize
  
  public void setAnalysisSize(int count)
  
  Sets the maximum number of characters to use in analysis. This value should be large enough to contain at least one records. By default, 1MB is analyzed.
  
  Parameters:
  
  count - the number of characters to analyze
- setLineComment
  
  public void setLineComment(String lineComment)
  
  Set the value of the indicator that a line is commented and should be ignored. This line comment indicator must be found at the beginning of a line to be considered a comment.
  
  Parameters:
  
  lineComment - the string value indicating a line is commented out
- setHeaderSkipCount
  
  public void setHeaderSkipCount(int count)
  
  Sets the number of lines to skip at the beginning of the file. Skipped lines are only analyzed for newline discovery; they are ignored in the remainder of the analysis. By default, no lines are skipped.
  
  Parameters:
  
  count - the number lines at the start of the file to skip
- analyze
  
  public DelimitedTextAnalyzer.Analysis analyze(String file, CharsetEncoding charsetSpec) throws IOException
  
  Analyzes the specified file based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.
  
  Parameters:
  
  file - path to the delimited text file to analyze
  
  charsetSpec - description of the file's character set encoding
  
  Returns:
  
  an analysis of the delimited text file
  
  Throws:
  
  IOException - if an error occurs while reading the file
  
  RowTooLongException - if the first row exceeds the configured length
- analyze
  
  public DelimitedTextAnalyzer.Analysis analyze(Path file, CharsetEncoding charsetSpec) throws IOException
  
  Analyzes the specified file based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.
  
  Parameters:
  
  file - path to the delimited text file to analyze
  
  charsetSpec - description of the file's character set encoding
  
  Returns:
  
  an analysis of the delimited text file
  
  Throws:
  
  IOException - if an error occurs while reading the file
  
  RowTooLongException - if the first row exceeds the configured length
- analyze
  
  public DelimitedTextAnalyzer.Analysis analyze(Path file, CharsetEncoding charsetSpec, FileClient client) throws IOException
  
  Analyzes the specified file based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.
  
  Parameters:
  
  file - path to the delimited text file to analyze
  
  charsetSpec - description of the file's character set encoding
  
  client - the authorization context to use for accessing the file
  
  Returns:
  
  an analysis of the delimited text file
  
  Throws:
  
  IOException - if an error occurs while reading the file
  
  RowTooLongException - if the first row exceeds the configured length
- analyze
  
  public DelimitedTextAnalyzer.Analysis analyze(Reader input) throws IOException, RowTooLongException
  
  Analyzes the specified text stream based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.
  
  Parameters:
  
  input - the text data to analyze
  
  Returns:
  
  an analysis of the delimited text
  
  Throws:
  
  IOException - if an error occurs while reading the file
  
  RowTooLongException - if the first row exceeds the configured length

Class DelimitedTextAnalyzer

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

DelimitedTextAnalyzer

Method Details

setAnalysisSize

setLineComment

setHeaderSkipCount

analyze

analyze

analyze

analyze