- All Implemented Interfaces:
LogicalOperator
The field containing the classification value (also referred to as the target feature) must be specified. It is not necessary to specify the fields used to calculate nearness (also referred to as the selected features). If omitted, they will be derived from the example and query schema, using all eligible fields. The example and query records need not have the same schema. All that is required is that:
- The selected features must be present in both the example and query
records and be of a numeric type (representing continuous data). In this
context, a numeric type is any type which can be widened to a
TokenTypeConstant.DOUBLE. - The target feature must be present in the example records and be either numeric (as described above) or categorical data.
The implementation is designed to minimize memory usage. It is possible to specify an approximate limit on the amount of memory used by the operator; it is not necessary to have sufficient memory to hold both the example and query data in memory, although performance is best in this case.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final longThe largest allowable training buffer, in bytes, 16G. -
Constructor Summary
ConstructorsConstructorDescriptionDefines a classifier initially configured with default settings: A neighborhood set size of 1 The target feature is in the field "class" Selected features are derived from the fields in common between the query and training data Nearness is determined using Euclidean distance Record classification is by voting A training buffer of 128M is usedKNNClassifier(int k, String targetFeature) Defines a classifier initially configured with the specified neighborhood set size and target feature field. -
Method Summary
Modifier and TypeMethodDescriptionprotected voidCompose the body of this operator.Gets how the classification of a record in the query data is determined from the classifications of its nearest neighbors in the example data.intgetK()Gets the size of the nearest neighbor set.Gets how the nearest neighbors of a record in the query data are determined.Gets the record port providing the output from the operation.getQuery()Gets the record port providing the query data to the operations.Gets the fields which will be used when determining the nearest neighbors.Gets the field in the example data which is used to provide classification data.Gets the record port providing the training data to the operations.longGets the size of the memory buffer used to hold the example data.voidSpecifies how to determine the classification of a record in the query data from the classifications of its nearest neighbors in the example data.voidsetK(int k) Sets the size of the nearest neighbor set.voidsetNearnessMeasure(NearnessMeasure measure) Specifies how to determine the nearest neighbors of a record in the query data.voidsetSelectedFeatures(String... features) Specifies the fields to use when determining the nearest neighbors.voidsetSelectedFeatures(List<String> features) Specifies the fields to use when determining the nearest neighbors.voidsetTargetFeature(String feature) Specifies the field in the example data which contains classification data.voidsetTrainingBuffer(long size) Specifies the amount of memory, in bytes, to use for buffering the example data.voidsetTrainingBuffer(String sizeSpecifier) Specifies the amount of memory to use for buffering the example data.Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
Field Details
-
TRAINING_BUFFER_SIZE_MAX
public static final long TRAINING_BUFFER_SIZE_MAXThe largest allowable training buffer, in bytes, 16G.- See Also:
-
-
Constructor Details
-
KNNClassifier
public KNNClassifier()Defines a classifier initially configured with default settings:- A neighborhood set size of 1
- The target feature is in the field "class"
- Selected features are derived from the fields in common between the query and training data
- Nearness is determined using Euclidean distance
- Record classification is by voting
- A training buffer of 128M is used
-
KNNClassifier
Defines a classifier initially configured with the specified neighborhood set size and target feature field. All other settings assume the default values indicated inKNNClassifier().- Parameters:
k- the size of the nearest neighbor settargetFeature- the field in the example data which contains classification data
-
-
Method Details
-
getTraining
Gets the record port providing the training data to the operations.- Returns:
- the training input port for the operation
-
getQuery
Gets the record port providing the query data to the operations.- Returns:
- the query input port for the operation
-
getOutput
Gets the record port providing the output from the operation. This will be the query data tagged with its determined classification.- Returns:
- the output port for the operation
-
setK
public void setK(int k) Sets the size of the nearest neighbor set. The algorithm will use this many neighbors to perform classification of query data.- Parameters:
k- the size of the nearest neighbor set.- Throws:
com.pervasive.datarush.graphs.physical.InvalidPropertyValueException- if the size is not positive.
-
getK
public int getK()Gets the size of the nearest neighbor set.- Returns:
- the number of neighbors to use when classifying query data
-
setSelectedFeatures
Specifies the fields to use when determining the nearest neighbors.These fields must be present in both the example and query records. They must also be numeric typed; in this context, any type which can be widened to a
TokenTypeConstant.DOUBLE.- Parameters:
features- the names of fields to use in computing nearness
-
setSelectedFeatures
Specifies the fields to use when determining the nearest neighbors.These fields must be present in both the example and query records. They must also be numeric typed; in this context, any type which can be widened to a
TokenTypeConstant.DOUBLE.- Parameters:
features- the names of fields to use in computing nearness
-
getSelectedFeatures
Gets the fields which will be used when determining the nearest neighbors.- Returns:
- the names of fields to use in computing nearness
-
setTargetFeature
Specifies the field in the example data which contains classification data.- Parameters:
feature- the name of the field to use as an example record's class
-
getTargetFeature
Gets the field in the example data which is used to provide classification data.- Returns:
- the name of the field used to obtain an example record's class
-
setNearnessMeasure
Specifies how to determine the nearest neighbors of a record in the query data.- Parameters:
measure- the measure used to determine "nearness"
-
getNearnessMeasure
Gets how the nearest neighbors of a record in the query data are determined.- Returns:
- the measure used to determine "nearness"
-
setClassificationScheme
Specifies how to determine the classification of a record in the query data from the classifications of its nearest neighbors in the example data.- Parameters:
scheme- the scheme used to classify a record
-
getClassificationScheme
Gets how the classification of a record in the query data is determined from the classifications of its nearest neighbors in the example data.- Returns:
- the scheme used to classify a record
-
setTrainingBuffer
public void setTrainingBuffer(long size) Specifies the amount of memory, in bytes, to use for buffering the example data.If this buffer is too small, temporary files will be used to store intermediate neighborhood data.
- Parameters:
size- the size of the buffer to use, in bytes
-
setTrainingBuffer
Specifies the amount of memory to use for buffering the example data. The amount is specified using standard multipliers such as K, M, and G.- Parameters:
sizeSpecifier- the size of the buffer to use
-
getTrainingBuffer
public long getTrainingBuffer()Gets the size of the memory buffer used to hold the example data.- Returns:
- the size of the buffer, in bytes
-
compose
Description copied from class:CompositeOperatorCompose the body of this operator. Implementations should do the following:- Perform any validation of configuration, input types, etc
- Instantiate and configure sub-operators, adding them to the provided context via
the method
OperatorComposable.add(O) - Create necessary connections via the method
OperatorComposable.connect(P, P). This includes connections from the composite's input ports to sub-operators, connections between sub-operators, and connections from sub-operators output ports to the composite's output ports
- Specified by:
composein classCompositeOperator- Parameters:
ctx- the context
-