Module datarush.library
Class DelimitedTextAnalyzer
java.lang.Object
com.pervasive.datarush.operators.io.textfile.DelimitedTextAnalyzer
An analyzer for files containing delimited text. An analysis
can perform a basic parsing of the file, permitting validation
of delimiter configuration.
The following information is provided as a result of analyzing
a file:
- The values of fields for analyzed records.
- The record separator. If the properties specify auto-detection of newline style, the analyzer will determine whether the file uses Windows-style CRLF or UNIX-style LF.
- The field separator. If the properties specify
auto-detection of the field separator, the analyzer will attempt
to determine the appropriate separator from a known set:
comma (
','), tab ('\t'), semicolon (';'), pipe ('|'), and space (' '). - The field delimiter. If the properties specify
auto-detection of the field separator, the analyzer will attempt
to determine the appropriate delimiter from a known set:
single quote (
') or double quote ("). If one cannot be determined, the text is assumed to be undelimited. - The comment marker. If the properties specify
auto-detection of the comment marker, the analyzer will attempt
to determine the appropriate comment marker from a known set:
#),%, and//. If one cannot be determined, it is assumed there is no comment marker.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classContains the results of an analysis of a delimited text file. -
Constructor Summary
ConstructorsConstructorDescriptionDelimitedTextAnalyzer(FieldDelimiterSpecifier delimiters) Creates a new analyzer which uses the given delimiter information. -
Method Summary
Modifier and TypeMethodDescriptionanalyze(Path file, CharsetEncoding charsetSpec) Analyzes the specified file based on current configuration.analyze(Path file, CharsetEncoding charsetSpec, FileClient client) Analyzes the specified file based on current configuration.Analyzes the specified text stream based on current configuration.analyze(String file, CharsetEncoding charsetSpec) Analyzes the specified file based on current configuration.voidsetAnalysisSize(int count) Sets the maximum number of characters to use in analysis.voidsetHeaderSkipCount(int count) Sets the number of lines to skip at the beginning of the file.voidsetLineComment(String lineComment) Set the value of the indicator that a line is commented and should be ignored.
-
Constructor Details
-
DelimitedTextAnalyzer
Creates a new analyzer which uses the given delimiter information. Initially, the analyzer is configured to allow unlimited record length and only parses the first row.- Parameters:
delimiters- field structure information from which to initialize settings
-
-
Method Details
-
setAnalysisSize
public void setAnalysisSize(int count) Sets the maximum number of characters to use in analysis. This value should be large enough to contain at least one records. By default, 1MB is analyzed.- Parameters:
count- the number of characters to analyze
-
setLineComment
Set the value of the indicator that a line is commented and should be ignored. This line comment indicator must be found at the beginning of a line to be considered a comment.- Parameters:
lineComment- the string value indicating a line is commented out
-
setHeaderSkipCount
public void setHeaderSkipCount(int count) Sets the number of lines to skip at the beginning of the file. Skipped lines are only analyzed for newline discovery; they are ignored in the remainder of the analysis. By default, no lines are skipped.- Parameters:
count- the number lines at the start of the file to skip
-
analyze
public DelimitedTextAnalyzer.Analysis analyze(String file, CharsetEncoding charsetSpec) throws IOException Analyzes the specified file based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.- Parameters:
file- path to the delimited text file to analyzecharsetSpec- description of the file's character set encoding- Returns:
- an analysis of the delimited text file
- Throws:
IOException- if an error occurs while reading the fileRowTooLongException- if the first row exceeds the configured length
-
analyze
public DelimitedTextAnalyzer.Analysis analyze(Path file, CharsetEncoding charsetSpec) throws IOException Analyzes the specified file based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.- Parameters:
file- path to the delimited text file to analyzecharsetSpec- description of the file's character set encoding- Returns:
- an analysis of the delimited text file
- Throws:
IOException- if an error occurs while reading the fileRowTooLongException- if the first row exceeds the configured length
-
analyze
public DelimitedTextAnalyzer.Analysis analyze(Path file, CharsetEncoding charsetSpec, FileClient client) throws IOException Analyzes the specified file based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.- Parameters:
file- path to the delimited text file to analyzecharsetSpec- description of the file's character set encodingclient- the authorization context to use for accessing the file- Returns:
- an analysis of the delimited text file
- Throws:
IOException- if an error occurs while reading the fileRowTooLongException- if the first row exceeds the configured length
-
analyze
Analyzes the specified text stream based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.- Parameters:
input- the text data to analyze- Returns:
- an analysis of the delimited text
- Throws:
IOException- if an error occurs while reading the fileRowTooLongException- if the first row exceeds the configured length
-