- java.lang.Object
-
- com.pervasive.datarush.operators.AbstractLogicalOperator
-
- com.pervasive.datarush.operators.CompositeOperator
-
- com.pervasive.datarush.analytics.knn.KNNClassifier
-
- All Implemented Interfaces:
LogicalOperator
public final class KNNClassifier extends CompositeOperator
Applies the K-nearest neighbor algorithm to classify input data against an already classified set of example data. A naive implementation is used, with each input record being compared against all example records to find the set of example records closest to it, as measured by a user-specified measure. The input record is then classified according to a user-specified method of combining the classes of the neighbors.The field containing the classification value (also referred to as the target feature) must be specified. It is not necessary to specify the fields used to calculate nearness (also referred to as the selected features). If omitted, they will be derived from the example and query schema, using all eligible fields. The example and query records need not have the same schema. All that is required is that:
- The selected features must be present in both the example and query
records and be of a numeric type (representing continuous data). In this
context, a numeric type is any type which can be widened to a
TokenTypeConstant.DOUBLE
. - The target feature must be present in the example records and be either numeric (as described above) or categorical data.
The implementation is designed to minimize memory usage. It is possible to specify an approximate limit on the amount of memory used by the operator; it is not necessary to have sufficient memory to hold both the example and query data in memory, although performance is best in this case.
-
-
Field Summary
Fields Modifier and Type Field Description static long
TRAINING_BUFFER_SIZE_MAX
The largest allowable training buffer, in bytes, 16G.
-
Constructor Summary
Constructors Constructor Description KNNClassifier()
Defines a classifier initially configured with default settings: A neighborhood set size of 1 The target feature is in the field "class" Selected features are derived from the fields in common between the query and training data Nearness is determined using Euclidean distance Record classification is by voting A training buffer of 128M is usedKNNClassifier(int k, String targetFeature)
Defines a classifier initially configured with the specified neighborhood set size and target feature field.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
compose(CompositionContext ctx)
Compose the body of this operator.ClassificationScheme
getClassificationScheme()
Gets how the classification of a record in the query data is determined from the classifications of its nearest neighbors in the example data.int
getK()
Gets the size of the nearest neighbor set.NearnessMeasure
getNearnessMeasure()
Gets how the nearest neighbors of a record in the query data are determined.RecordPort
getOutput()
Gets the record port providing the output from the operation.RecordPort
getQuery()
Gets the record port providing the query data to the operations.List<String>
getSelectedFeatures()
Gets the fields which will be used when determining the nearest neighbors.String
getTargetFeature()
Gets the field in the example data which is used to provide classification data.RecordPort
getTraining()
Gets the record port providing the training data to the operations.long
getTrainingBuffer()
Gets the size of the memory buffer used to hold the example data.void
setClassificationScheme(ClassificationScheme scheme)
Specifies how to determine the classification of a record in the query data from the classifications of its nearest neighbors in the example data.void
setK(int k)
Sets the size of the nearest neighbor set.void
setNearnessMeasure(NearnessMeasure measure)
Specifies how to determine the nearest neighbors of a record in the query data.void
setSelectedFeatures(String... features)
Specifies the fields to use when determining the nearest neighbors.void
setSelectedFeatures(List<String> features)
Specifies the fields to use when determining the nearest neighbors.void
setTargetFeature(String feature)
Specifies the field in the example data which contains classification data.void
setTrainingBuffer(long size)
Specifies the amount of memory, in bytes, to use for buffering the example data.void
setTrainingBuffer(String sizeSpecifier)
Specifies the amount of memory to use for buffering the example data.-
Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
-
-
-
Field Detail
-
TRAINING_BUFFER_SIZE_MAX
public static final long TRAINING_BUFFER_SIZE_MAX
The largest allowable training buffer, in bytes, 16G.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
KNNClassifier
public KNNClassifier()
Defines a classifier initially configured with default settings:- A neighborhood set size of 1
- The target feature is in the field "class"
- Selected features are derived from the fields in common between the query and training data
- Nearness is determined using Euclidean distance
- Record classification is by voting
- A training buffer of 128M is used
-
KNNClassifier
public KNNClassifier(int k, String targetFeature)
Defines a classifier initially configured with the specified neighborhood set size and target feature field. All other settings assume the default values indicated inKNNClassifier()
.- Parameters:
k
- the size of the nearest neighbor settargetFeature
- the field in the example data which contains classification data
-
-
Method Detail
-
getTraining
public RecordPort getTraining()
Gets the record port providing the training data to the operations.- Returns:
- the training input port for the operation
-
getQuery
public RecordPort getQuery()
Gets the record port providing the query data to the operations.- Returns:
- the query input port for the operation
-
getOutput
public RecordPort getOutput()
Gets the record port providing the output from the operation. This will be the query data tagged with its determined classification.- Returns:
- the output port for the operation
-
setK
public void setK(int k)
Sets the size of the nearest neighbor set. The algorithm will use this many neighbors to perform classification of query data.- Parameters:
k
- the size of the nearest neighbor set.- Throws:
com.pervasive.datarush.graphs.physical.InvalidPropertyValueException
- if the size is not positive.
-
getK
public int getK()
Gets the size of the nearest neighbor set.- Returns:
- the number of neighbors to use when classifying query data
-
setSelectedFeatures
public void setSelectedFeatures(List<String> features)
Specifies the fields to use when determining the nearest neighbors.These fields must be present in both the example and query records. They must also be numeric typed; in this context, any type which can be widened to a
TokenTypeConstant.DOUBLE
.- Parameters:
features
- the names of fields to use in computing nearness
-
setSelectedFeatures
public void setSelectedFeatures(String... features)
Specifies the fields to use when determining the nearest neighbors.These fields must be present in both the example and query records. They must also be numeric typed; in this context, any type which can be widened to a
TokenTypeConstant.DOUBLE
.- Parameters:
features
- the names of fields to use in computing nearness
-
getSelectedFeatures
public List<String> getSelectedFeatures()
Gets the fields which will be used when determining the nearest neighbors.- Returns:
- the names of fields to use in computing nearness
-
setTargetFeature
public void setTargetFeature(String feature)
Specifies the field in the example data which contains classification data.- Parameters:
feature
- the name of the field to use as an example record's class
-
getTargetFeature
public String getTargetFeature()
Gets the field in the example data which is used to provide classification data.- Returns:
- the name of the field used to obtain an example record's class
-
setNearnessMeasure
public void setNearnessMeasure(NearnessMeasure measure)
Specifies how to determine the nearest neighbors of a record in the query data.- Parameters:
measure
- the measure used to determine "nearness"
-
getNearnessMeasure
public NearnessMeasure getNearnessMeasure()
Gets how the nearest neighbors of a record in the query data are determined.- Returns:
- the measure used to determine "nearness"
-
setClassificationScheme
public void setClassificationScheme(ClassificationScheme scheme)
Specifies how to determine the classification of a record in the query data from the classifications of its nearest neighbors in the example data.- Parameters:
scheme
- the scheme used to classify a record
-
getClassificationScheme
public ClassificationScheme getClassificationScheme()
Gets how the classification of a record in the query data is determined from the classifications of its nearest neighbors in the example data.- Returns:
- the scheme used to classify a record
-
setTrainingBuffer
public void setTrainingBuffer(long size)
Specifies the amount of memory, in bytes, to use for buffering the example data.If this buffer is too small, temporary files will be used to store intermediate neighborhood data.
- Parameters:
size
- the size of the buffer to use, in bytes
-
setTrainingBuffer
public void setTrainingBuffer(String sizeSpecifier)
Specifies the amount of memory to use for buffering the example data. The amount is specified using standard multipliers such as K, M, and G.- Parameters:
sizeSpecifier
- the size of the buffer to use
-
getTrainingBuffer
public long getTrainingBuffer()
Gets the size of the memory buffer used to hold the example data.- Returns:
- the size of the buffer, in bytes
-
compose
protected void compose(CompositionContext ctx)
Description copied from class:CompositeOperator
Compose the body of this operator. Implementations should do the following:- Perform any validation of configuration, input types, etc
- Instantiate and configure sub-operators, adding them to the provided context via
the method
OperatorComposable.add(O)
- Create necessary connections via the method
OperatorComposable.connect(P, P)
. This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
- Specified by:
compose
in classCompositeOperator
- Parameters:
ctx
- the context
-
-