All Implemented Interfaces:
LogicalOperator, RecordSourceOperator, SourceOperator<RecordPort>

public class ReadARFF extends AbstractTextReader
Read files in the Attribute-Relation File Format (ARFF). Files in ARFF can be in either sparse or dense mode. This reader detects the mode and reads the data accordingly. ARFF files contain schema information. The schema is parsed and used to determine how to parse data lines.

ARFF can be parsed in parallel under "optimistic" assumptions: namely, that parse splits do not occur in the middle of a delimited field value and somewhere before an escaped record separator. This is assumed by default, but can be disabled, with an accompanying reduction of scalability and performance.

  • Constructor Details

    • ReadARFF

      public ReadARFF()
      Reads an empty source with default settings. The source must be set before execution or an error will be raised.
      See Also:
    • ReadARFF

      public ReadARFF(String pattern)
      Reads all paths matching the specified pattern as ARFF data using default options. Any matching path which is a directory is replaced with all files in the directory; this expansion is not applied recursively.
      Parameters:
      pattern - a path-matching pattern
      See Also:
    • ReadARFF

      public ReadARFF(Path path)
      Reads the file specified by the path as ARFF data using default options. If the path refers to a a directory, all files in the directory are read; this expansion is not applied recursively.
      Parameters:
      path - the path to read
    • ReadARFF

      public ReadARFF(ByteSource source)
      Reads the specified data source using default options.
      Parameters:
      source - the data source to read
  • Method Details

    • setFieldDelimiter

      public void setFieldDelimiter(char fieldDelimiter)
      Set the field delimiter to use when reading the file contents. A single quote is used by default. The only supported values are a single quote and a double quote.
      Parameters:
      fieldDelimiter - character value to use the field delimiter
    • getFieldDelimiter

      public char getFieldDelimiter()
      Get the configured field delimiter property value.
      Returns:
      configured field delimiter
    • discoverMetadata

      public ARFFAnalyzer.Analysis discoverMetadata(FileClient ctx)
      Gets the metadata for the currently configured data source.
      Parameters:
      ctx - the authorization context to use for accessing the file
      Returns:
      the metadata of the source
    • computeFormat

      protected DataFormat computeFormat(CompositionContext ctx)
      Description copied from class: AbstractReader
      Determines the data format for the source. The returned format is used during composition to construct a ReadSource operator. If an implementation supports schema discovery, it must be performed in this method.
      Specified by:
      computeFormat in class AbstractReader
      Parameters:
      ctx - the composition context for the current invocation of AbstractReader.compose(CompositionContext)
      Returns:
      the source format to use