com.pervasive.datarush.analytics.knn.KNNClassifier

All Implemented Interfaces:: LogicalOperator

public final class KNNClassifier extends CompositeOperator

Applies the K-nearest neighbor algorithm to classify input data against an already classified set of example data. A naive implementation is used, with each input record being compared against all example records to find the set of example records closest to it, as measured by a user-specified measure. The input record is then classified according to a user-specified method of combining the classes of the neighbors.

The field containing the classification value (also referred to as the target feature) must be specified. It is not necessary to specify the fields used to calculate nearness (also referred to as the selected features). If omitted, they will be derived from the example and query schema, using all eligible fields. The example and query records need not have the same schema. All that is required is that:

The selected features must be present in both the example and query records and be of a numeric type (representing continuous data). In this context, a numeric type is any type which can be widened to a TokenTypeConstant.DOUBLE.
The target feature must be present in the example records and be either numeric (as described above) or categorical data.

The output consists of the query data with the resulting classification appended to it. This value is in the field named "PREDICTED_VAL".

The implementation is designed to minimize memory usage. It is possible to specify an approximate limit on the amount of memory used by the operator; it is not necessary to have sufficient memory to hold both the example and query data in memory, although performance is best in this case.

Field Summary

Fields

Modifier and Type

Field

Description

static final long

TRAINING_BUFFER_SIZE_MAX

The largest allowable training buffer, in bytes, 16G.
Constructor Summary

Constructors

Constructor

Description

KNNClassifier()

Defines a classifier initially configured with default settings: A neighborhood set size of 1 The target feature is in the field "class" Selected features are derived from the fields in common between the query and training data Nearness is determined using Euclidean distance Record classification is by voting A training buffer of 128M is used

KNNClassifier(int k, String targetFeature)

Defines a classifier initially configured with the specified neighborhood set size and target feature field.
Method Summary

Modifier and Type

Method

Description

protected void

compose(CompositionContext ctx)

Compose the body of this operator.

ClassificationScheme

getClassificationScheme()

Gets how the classification of a record in the query data is determined from the classifications of its nearest neighbors in the example data.

int

getK()

Gets the size of the nearest neighbor set.

NearnessMeasure

getNearnessMeasure()

Gets how the nearest neighbors of a record in the query data are determined.

RecordPort

getOutput()

Gets the record port providing the output from the operation.

RecordPort

getQuery()

Gets the record port providing the query data to the operations.

List<String>

getSelectedFeatures()

Gets the fields which will be used when determining the nearest neighbors.

String

getTargetFeature()

Gets the field in the example data which is used to provide classification data.

RecordPort

getTraining()

Gets the record port providing the training data to the operations.

long

getTrainingBuffer()

Gets the size of the memory buffer used to hold the example data.

void

setClassificationScheme(ClassificationScheme scheme)

Specifies how to determine the classification of a record in the query data from the classifications of its nearest neighbors in the example data.

void

setK(int k)

Sets the size of the nearest neighbor set.

void

setNearnessMeasure(NearnessMeasure measure)

Specifies how to determine the nearest neighbors of a record in the query data.

void

setSelectedFeatures(String... features)

Specifies the fields to use when determining the nearest neighbors.

void

setSelectedFeatures(List<String> features)

Specifies the fields to use when determining the nearest neighbors.

void

setTargetFeature(String feature)

Specifies the field in the example data which contains classification data.

void

setTrainingBuffer(long size)

Specifies the amount of memory, in bytes, to use for buffering the example data.

void

setTrainingBuffer(String sizeSpecifier)

Specifies the amount of memory to use for buffering the example data.

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- TRAINING_BUFFER_SIZE_MAX
  
  public static final long TRAINING_BUFFER_SIZE_MAX
  
  The largest allowable training buffer, in bytes, 16G.
  See Also:
  
  Constant Field Values
Constructor Details
- KNNClassifier
  
  public KNNClassifier()
  Defines a classifier initially configured with default settings:
  
  A neighborhood set size of 1
  
  The target feature is in the field "class"
  
  Selected features are derived from the fields in common between the query and training data
  
  Nearness is determined using Euclidean distance
  
  Record classification is by voting
  
  A training buffer of 128M is used
- KNNClassifier
  
  public KNNClassifier(int k, String targetFeature)
  
  Defines a classifier initially configured with the specified neighborhood set size and target feature field. All other settings assume the default values indicated in KNNClassifier().
  
  Parameters:
  
  k - the size of the nearest neighbor set
  
  targetFeature - the field in the example data which contains classification data
Method Details
- getTraining
  
  public RecordPort getTraining()
  
  Gets the record port providing the training data to the operations.
  
  Returns:
  
  the training input port for the operation
- getQuery
  
  public RecordPort getQuery()
  
  Gets the record port providing the query data to the operations.
  
  Returns:
  
  the query input port for the operation
- getOutput
  
  public RecordPort getOutput()
  
  Gets the record port providing the output from the operation. This will be the query data tagged with its determined classification.
  
  Returns:
  
  the output port for the operation
- setK
  
  public void setK(int k)
  
  Sets the size of the nearest neighbor set. The algorithm will use this many neighbors to perform classification of query data.
  
  Parameters:
  
  k - the size of the nearest neighbor set.
  
  Throws:
  
  com.pervasive.datarush.graphs.physical.InvalidPropertyValueException - if the size is not positive.
- getK
  
  public int getK()
  
  Gets the size of the nearest neighbor set.
  
  Returns:
  
  the number of neighbors to use when classifying query data
- setSelectedFeatures
  
  public void setSelectedFeatures(List<String> features)
  
  Specifies the fields to use when determining the nearest neighbors.
  These fields must be present in both the example and query records. They must also be numeric typed; in this context, any type which can be widened to a TokenTypeConstant.DOUBLE.
  
  Parameters:
  
  features - the names of fields to use in computing nearness
- setSelectedFeatures
  
  public void setSelectedFeatures(String... features)
  
  Specifies the fields to use when determining the nearest neighbors.
  These fields must be present in both the example and query records. They must also be numeric typed; in this context, any type which can be widened to a TokenTypeConstant.DOUBLE.
  
  Parameters:
  
  features - the names of fields to use in computing nearness
- getSelectedFeatures
  
  public List<String> getSelectedFeatures()
  
  Gets the fields which will be used when determining the nearest neighbors.
  
  Returns:
  
  the names of fields to use in computing nearness
- setTargetFeature
  
  public void setTargetFeature(String feature)
  
  Specifies the field in the example data which contains classification data.
  
  Parameters:
  
  feature - the name of the field to use as an example record's class
- getTargetFeature
  
  public String getTargetFeature()
  
  Gets the field in the example data which is used to provide classification data.
  
  Returns:
  
  the name of the field used to obtain an example record's class
- setNearnessMeasure
  
  public void setNearnessMeasure(NearnessMeasure measure)
  
  Specifies how to determine the nearest neighbors of a record in the query data.
  
  Parameters:
  
  measure - the measure used to determine "nearness"
- getNearnessMeasure
  
  public NearnessMeasure getNearnessMeasure()
  
  Gets how the nearest neighbors of a record in the query data are determined.
  
  Returns:
  
  the measure used to determine "nearness"
- setClassificationScheme
  
  public void setClassificationScheme(ClassificationScheme scheme)
  
  Specifies how to determine the classification of a record in the query data from the classifications of its nearest neighbors in the example data.
  
  Parameters:
  
  scheme - the scheme used to classify a record
- getClassificationScheme
  
  public ClassificationScheme getClassificationScheme()
  
  Gets how the classification of a record in the query data is determined from the classifications of its nearest neighbors in the example data.
  
  Returns:
  
  the scheme used to classify a record
- setTrainingBuffer
  
  public void setTrainingBuffer(long size)
  
  Specifies the amount of memory, in bytes, to use for buffering the example data.
  If this buffer is too small, temporary files will be used to store intermediate neighborhood data.
  
  Parameters:
  
  size - the size of the buffer to use, in bytes
- setTrainingBuffer
  
  public void setTrainingBuffer(String sizeSpecifier)
  
  Specifies the amount of memory to use for buffering the example data. The amount is specified using standard multipliers such as K, M, and G.
  
  Parameters:
  
  sizeSpecifier - the size of the buffer to use
- getTrainingBuffer
  
  public long getTrainingBuffer()
  
  Gets the size of the memory buffer used to hold the example data.
  
  Returns:
  
  the size of the buffer, in bytes
- compose
  
  protected void compose(CompositionContext ctx)
  
  Description copied from class: CompositeOperator
  Compose the body of this operator. Implementations should do the following:
  
  Perform any validation of configuration, input types, etc
  
  Instantiate and configure sub-operators, adding them to the provided context via the method OperatorComposable.add(O)
  
  Create necessary connections via the method OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
  Specified by:
  
  compose in class CompositeOperator
  
  Parameters:
  
  ctx - the context

Class KNNClassifier

Field Summary

Constructor Summary

Method Summary

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator

Methods inherited from class java.lang.Object

Field Details

TRAINING_BUFFER_SIZE_MAX

Constructor Details

KNNClassifier

KNNClassifier

Method Details

getTraining

getQuery

getOutput

setK

getK

setSelectedFeatures

setSelectedFeatures

getSelectedFeatures

setTargetFeature

getTargetFeature

setNearnessMeasure

getNearnessMeasure

setClassificationScheme

getClassificationScheme

setTrainingBuffer

setTrainingBuffer

getTrainingBuffer

compose