Class KNNClassifier

  • All Implemented Interfaces:
    LogicalOperator

    public final class KNNClassifier
    extends CompositeOperator
    Applies the K-nearest neighbor algorithm to classify input data against an already classified set of example data. A naive implementation is used, with each input record being compared against all example records to find the set of example records closest to it, as measured by a user-specified measure. The input record is then classified according to a user-specified method of combining the classes of the neighbors.

    The field containing the classification value (also referred to as the target feature) must be specified. It is not necessary to specify the fields used to calculate nearness (also referred to as the selected features). If omitted, they will be derived from the example and query schema, using all eligible fields. The example and query records need not have the same schema. All that is required is that:

    • The selected features must be present in both the example and query records and be of a numeric type (representing continuous data). In this context, a numeric type is any type which can be widened to a TokenTypeConstant.DOUBLE.
    • The target feature must be present in the example records and be either numeric (as described above) or categorical data.
    The output consists of the query data with the resulting classification appended to it. This value is in the field named "PREDICTED_VAL".

    The implementation is designed to minimize memory usage. It is possible to specify an approximate limit on the amount of memory used by the operator; it is not necessary to have sufficient memory to hold both the example and query data in memory, although performance is best in this case.

    • Field Detail

      • TRAINING_BUFFER_SIZE_MAX

        public static final long TRAINING_BUFFER_SIZE_MAX
        The largest allowable training buffer, in bytes, 16G.
        See Also:
        Constant Field Values
    • Constructor Detail

      • KNNClassifier

        public KNNClassifier()
        Defines a classifier initially configured with default settings:
        • A neighborhood set size of 1
        • The target feature is in the field "class"
        • Selected features are derived from the fields in common between the query and training data
        • Nearness is determined using Euclidean distance
        • Record classification is by voting
        • A training buffer of 128M is used
      • KNNClassifier

        public KNNClassifier​(int k,
                             String targetFeature)
        Defines a classifier initially configured with the specified neighborhood set size and target feature field. All other settings assume the default values indicated in KNNClassifier().
        Parameters:
        k - the size of the nearest neighbor set
        targetFeature - the field in the example data which contains classification data
    • Method Detail

      • getTraining

        public RecordPort getTraining()
        Gets the record port providing the training data to the operations.
        Returns:
        the training input port for the operation
      • getQuery

        public RecordPort getQuery()
        Gets the record port providing the query data to the operations.
        Returns:
        the query input port for the operation
      • getOutput

        public RecordPort getOutput()
        Gets the record port providing the output from the operation. This will be the query data tagged with its determined classification.
        Returns:
        the output port for the operation
      • setK

        public void setK​(int k)
        Sets the size of the nearest neighbor set. The algorithm will use this many neighbors to perform classification of query data.
        Parameters:
        k - the size of the nearest neighbor set.
        Throws:
        com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the size is not positive.
      • getK

        public int getK()
        Gets the size of the nearest neighbor set.
        Returns:
        the number of neighbors to use when classifying query data
      • setSelectedFeatures

        public void setSelectedFeatures​(List<String> features)
        Specifies the fields to use when determining the nearest neighbors.

        These fields must be present in both the example and query records. They must also be numeric typed; in this context, any type which can be widened to a TokenTypeConstant.DOUBLE.

        Parameters:
        features - the names of fields to use in computing nearness
      • setSelectedFeatures

        public void setSelectedFeatures​(String... features)
        Specifies the fields to use when determining the nearest neighbors.

        These fields must be present in both the example and query records. They must also be numeric typed; in this context, any type which can be widened to a TokenTypeConstant.DOUBLE.

        Parameters:
        features - the names of fields to use in computing nearness
      • getSelectedFeatures

        public List<String> getSelectedFeatures()
        Gets the fields which will be used when determining the nearest neighbors.
        Returns:
        the names of fields to use in computing nearness
      • setTargetFeature

        public void setTargetFeature​(String feature)
        Specifies the field in the example data which contains classification data.
        Parameters:
        feature - the name of the field to use as an example record's class
      • getTargetFeature

        public String getTargetFeature()
        Gets the field in the example data which is used to provide classification data.
        Returns:
        the name of the field used to obtain an example record's class
      • setNearnessMeasure

        public void setNearnessMeasure​(NearnessMeasure measure)
        Specifies how to determine the nearest neighbors of a record in the query data.
        Parameters:
        measure - the measure used to determine "nearness"
      • getNearnessMeasure

        public NearnessMeasure getNearnessMeasure()
        Gets how the nearest neighbors of a record in the query data are determined.
        Returns:
        the measure used to determine "nearness"
      • setClassificationScheme

        public void setClassificationScheme​(ClassificationScheme scheme)
        Specifies how to determine the classification of a record in the query data from the classifications of its nearest neighbors in the example data.
        Parameters:
        scheme - the scheme used to classify a record
      • getClassificationScheme

        public ClassificationScheme getClassificationScheme()
        Gets how the classification of a record in the query data is determined from the classifications of its nearest neighbors in the example data.
        Returns:
        the scheme used to classify a record
      • setTrainingBuffer

        public void setTrainingBuffer​(long size)
        Specifies the amount of memory, in bytes, to use for buffering the example data.

        If this buffer is too small, temporary files will be used to store intermediate neighborhood data.

        Parameters:
        size - the size of the buffer to use, in bytes
      • setTrainingBuffer

        public void setTrainingBuffer​(String sizeSpecifier)
        Specifies the amount of memory to use for buffering the example data. The amount is specified using standard multipliers such as K, M, and G.
        Parameters:
        sizeSpecifier - the size of the buffer to use
      • getTrainingBuffer

        public long getTrainingBuffer()
        Gets the size of the memory buffer used to hold the example data.
        Returns:
        the size of the buffer, in bytes
      • compose

        protected void compose​(CompositionContext ctx)
        Description copied from class: CompositeOperator
        Compose the body of this operator. Implementations should do the following:
        1. Perform any validation of configuration, input types, etc
        2. Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
        3. Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
        Specified by:
        compose in class CompositeOperator
        Parameters:
        ctx - the context