Class DelimitedTextAnalyzer


  • public class DelimitedTextAnalyzer
    extends Object
    An analyzer for files containing delimited text. An analysis can perform a basic parsing of the file, permitting validation of delimiter configuration. The following information is provided as a result of analyzing a file:
    • The values of fields for analyzed records.
    • The record separator. If the properties specify auto-detection of newline style, the analyzer will determine whether the file uses Windows-style CRLF or UNIX-style LF.
    • The field separator. If the properties specify auto-detection of the field separator, the analyzer will attempt to determine the appropriate separator from a known set: comma (','), tab ('\t'), semicolon (';'), pipe ('|'), and space (' ').
    • The field delimiter. If the properties specify auto-detection of the field separator, the analyzer will attempt to determine the appropriate delimiter from a known set: single quote (') or double quote ("). If one cannot be determined, the text is assumed to be undelimited.
    • The comment marker. If the properties specify auto-detection of the comment marker, the analyzer will attempt to determine the appropriate comment marker from a known set: #), %, and //. If one cannot be determined, it is assumed there is no comment marker.
    This information can be used to generate a schema for the records, but also could be used to provide a preview of how a file would be parsed with given settings.
    • Constructor Detail

      • DelimitedTextAnalyzer

        public DelimitedTextAnalyzer​(FieldDelimiterSpecifier delimiters)
        Creates a new analyzer which uses the given delimiter information. Initially, the analyzer is configured to allow unlimited record length and only parses the first row.
        Parameters:
        delimiters - field structure information from which to initialize settings
    • Method Detail

      • setAnalysisSize

        public void setAnalysisSize​(int count)
        Sets the maximum number of characters to use in analysis. This value should be large enough to contain at least one records. By default, 1MB is analyzed.
        Parameters:
        count - the number of characters to analyze
      • setLineComment

        public void setLineComment​(String lineComment)
        Set the value of the indicator that a line is commented and should be ignored. This line comment indicator must be found at the beginning of a line to be considered a comment.
        Parameters:
        lineComment - the string value indicating a line is commented out
      • setHeaderSkipCount

        public void setHeaderSkipCount​(int count)
        Sets the number of lines to skip at the beginning of the file. Skipped lines are only analyzed for newline discovery; they are ignored in the remainder of the analysis. By default, no lines are skipped.
        Parameters:
        count - the number lines at the start of the file to skip
      • analyze

        public DelimitedTextAnalyzer.Analysis analyze​(String file,
                                                      CharsetEncoding charsetSpec)
                                               throws IOException
        Analyzes the specified file based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.
        Parameters:
        file - path to the delimited text file to analyze
        charsetSpec - description of the file's character set encoding
        Returns:
        an analysis of the delimited text file
        Throws:
        IOException - if an error occurs while reading the file
        RowTooLongException - if the first row exceeds the configured length
      • analyze

        public DelimitedTextAnalyzer.Analysis analyze​(Path file,
                                                      CharsetEncoding charsetSpec)
                                               throws IOException
        Analyzes the specified file based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.
        Parameters:
        file - path to the delimited text file to analyze
        charsetSpec - description of the file's character set encoding
        Returns:
        an analysis of the delimited text file
        Throws:
        IOException - if an error occurs while reading the file
        RowTooLongException - if the first row exceeds the configured length
      • analyze

        public DelimitedTextAnalyzer.Analysis analyze​(Path file,
                                                      CharsetEncoding charsetSpec,
                                                      FileClient client)
                                               throws IOException
        Analyzes the specified file based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.
        Parameters:
        file - path to the delimited text file to analyze
        charsetSpec - description of the file's character set encoding
        client - the authorization context to use for accessing the file
        Returns:
        an analysis of the delimited text file
        Throws:
        IOException - if an error occurs while reading the file
        RowTooLongException - if the first row exceeds the configured length
      • analyze

        public DelimitedTextAnalyzer.Analysis analyze​(Reader input)
                                               throws IOException,
                                                      RowTooLongException
        Analyzes the specified text stream based on current configuration. The file will be processed assuming the delimiters with which the analyzer was constructed. The analysis will also indicate the delimiters used in the file. This will be the set of delimiters provided initially to the analyzer plus any discovered delimiters.
        Parameters:
        input - the text data to analyze
        Returns:
        an analysis of the delimited text
        Throws:
        IOException - if an error occurs while reading the file
        RowTooLongException - if the first row exceeds the configured length