java.lang.Object
- com.pervasive.datarush.operators.AbstractLogicalOperator
- - com.pervasive.datarush.operators.IterativeOperator
  - - com.pervasive.datarush.analytics.cluster.kmeans.KMeans

All Implemented Interfaces:

LogicalOperator, Serializable
```
public final class KMeans
extends IterativeOperator
implements Serializable
```
Computes clustering model for the given input based on the k-Means algorithm. All included fields of the given input must be of type double, float, long or int. The k-Means algorithm choses k random data points as the initial cluster centers. Within each iteration, it assigns input data points to current clusters based on the distance to the cluster centers and recomputes the centers based on the new assignments. Computation stops when one of the following conditions is true:
1. maxIterations is exceeded
2. the cluster centers do not change significantly between two iterations (see #isSignificantlyDifferent())
See Also:

Serialized Form

Constructor Summary

Constructors
Constructor Description

KMeans()
Create a new KMeans operator, initializing settings to their default values.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`protected void`	`computeMetadata(IterativeMetadataContext ctx)`	Implementations must adhere to the following contracts
`protected CompositionIterator`	`createIterator(MetadataContext ctx)`	Invoked at the start of execution.
`DistanceMeasure`	`getDistanceMeasure()`	Returns the distance measure used to measure the distance between two points when building the model.
`List<String>`	`getIncludedColumns()`	Returns the list of columns to include for k-Means.
`RecordPort`	`getInput()`	Returns the input port.
`int`	`getK()`	Returns the "k" value, where k is the number of centroids to compute.
`int`	`getMaxIterations()`	Returns the maximum number of iterations.
`PMMLPort`	`getModel()`	Returns the model port.
`static boolean`	`isSignificantlyDifferent(Cluster first, Cluster second)`	Returns true if the centers of both clusters are significantly different compared to each other.
`void`	`setDistanceMeasure(DistanceMeasure distanceMeasure)`	Sets the distance measure used to measure the distance between two points when building the model.
`void`	`setIncludedColumns(List<String> includedColumns)`	Sets the list of columns to include for k-Means.
`void`	`setK(int k)`	Sets the "k" value, where k is the number of centroids to compute.
`void`	`setMaxIterations(int max)`	Sets the maximum number of iterations.

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - KMeans
```
public KMeans()
```
    Create a new KMeans operator, initializing settings to their default values.
- Method Detail
  - getInput
```
public RecordPort getInput()
```
    Returns the input port. This is the learning data from which the model is built.
    
    Returns:
    
    the input port
  - getModel
```
public PMMLPort getModel()
```
    Returns the model port. This is the model which is built from the learning data.
    
    Returns:
    
    the model port
  - getMaxIterations
```
public int getMaxIterations()
```
    Returns the maximum number of iterations. Defaults to 99.
    
    Returns:
    
    The maximum number of iterations.
  - setMaxIterations
```
public void setMaxIterations(int max)
```
    Sets the maximum number of iterations. Defaults to 99.
    
    Parameters:
    
    max - The maximum number of iterations.
  - getK
```
public int getK()
```
    Returns the "k" value, where k is the number of centroids to compute. Defaults to 3.
    
    Returns:
    
    the "k" value.
  - setK
```
public void setK(int k)
```
    Sets the "k" value, where k is the number of centroids to compute. Defaults to 3.
    
    Parameters:
    
    k - the "k" value.
  - getIncludedColumns
```
public List<String> getIncludedColumns()
```
    Returns the list of columns to include for k-Means. An empty list means all columns of type double.
    
    Returns:
    
    The list of columns to include
  - setIncludedColumns
```
public void setIncludedColumns(List<String> includedColumns)
```
    Sets the list of columns to include for k-Means. An empty list means all columns of type double.
    
    Parameters:
    
    includedColumns - The list of columns to include
  - getDistanceMeasure
```
public DistanceMeasure getDistanceMeasure()
```
    Returns the distance measure used to measure the distance between two points when building the model.
    
    Returns:
    
    the distance measure
  - setDistanceMeasure
```
public void setDistanceMeasure(DistanceMeasure distanceMeasure)
```
    Sets the distance measure used to measure the distance between two points when building the model.
    
    Parameters:
    
    distanceMeasure - the distance measure
  - computeMetadata
```
protected void computeMetadata(IterativeMetadataContext ctx)
```
    Description copied from class: IterativeOperator
    Implementations must adhere to the following contracts
    General
    Regardless of input ports/output port types, all implementations must do the following:
    
    Validation. Validation of configuration should always be performed first.
    
    Declare operator parallelizability. Implementations must declare by calling IterativeMetadataContext.parallelize(ParallelismStrategy).
    Declare output port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setOutputParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
    
    Declare input port parallelizablility. Implementations must declare by calling IterativeMetadataContext.setIterationParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean).
    
    NOTE: There is a convenience method for performing steps 2-4 for the case where all record ports are parallelizable and where we are to determine parallelism based on source:
    
    MetadataUtil#negotiateParallelismBasedOnSourceAssumingParallelizableRecords
    
    Input record ports
    Implementations with input record ports must declare the following:
    
    Required data ordering:
    Implementations that have data ordering requirements must declare them by calling RecordPort#setRequiredDataOrdering, otherwise iteration will proceed on an input dataset whose order is undefined.
    Required data distribution (only applies to parallelizable input ports):
    Implementations that have data distribution requirements must declare them by calling RecordPort#setRequiredDataDistribution, otherwise iteration will proceed on an input dataset whose distribution is the unspecified partial distribution.
    Note that if the upstream operator's output distribution/ordering is compatible with those required, we avoid a re-sort/re-distribution which is generally a very large savings from a performance standpoint.
    Output record ports (static metadata)
    Implementations with output record ports must declare the following:
    
    Type: Implementations must declare their output type by calling RecordPort#setType.
    
    Implementations with output record ports may declare the following:
    
    Output data ordering: Implementations that can make guarantees as to their output ordering may do so by calling RecordPort#setOutputDataOrdering
    
    Output data distribution (only applies to parallelizable output ports): Implementations that can make guarantees as to their output distribution may do so by calling RecordPort#setOutputDataDistribution
    
    Note that both of these properties are optional; if unspecified, performance may suffer since the framework may unnecessarily re-sort/re-distributed the data.
    Input model ports
    In general, iterative operators will tend not to have model input ports, but if so, there is nothing special to declare for input model ports. Models are implicitly duplicated to all partitions when going from non-parallel to parallel ports.
    Output model ports (static metadata)
    SimpleModelPort's have no associated metadata and therefore there is never any output metadata to declare. PMMLPort's, on the other hand, do have associated metadata. For all PMMLPorts, implementations must declare the following:
    
    pmmlModelSpec: Implementations must declare the PMML model spec by calling PMMLPort.setPMMLModelSpec.
    
    Output ports with dynamic metadata
    If an output port has dynamic metadata, implementations can declare by calling IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean). In the case that metadata is dynamic, calls to RecordPort#setType, RecordPort#setOutputDataOrdering, etc are not allowed and thus the sections above entitled "Output record ports (static metadata)" and "Output model ports (static metadata)" must be skipped. Note that, if possible, dynamic metadata should be avoided (see IterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)).
    Specified by:
    
    computeMetadata in class IterativeOperator
    
    Parameters:
    
    ctx - the context
  - createIterator
```
protected CompositionIterator createIterator(MetadataContext ctx)
```
    Description copied from class: IterativeOperator
    
    Invoked at the start of execution. The iterator is expected to return a handle that is then used for execution.
    
    Specified by:
    
    createIterator in class IterativeOperator
    
    Parameters:
    
    ctx - a context in which the iterative operator can find input port metadata, etc. this information was available in the previous call to IterativeOperator.computeMetadata(IterativeMetadataContext), but is available here as well so that the iterative operator need not cache any metadata in its instance variables.
    
    Returns:
    
    a handle that is used for iteration
  - isSignificantlyDifferent
```
public static boolean isSignificantlyDifferent(Cluster first,
                                               Cluster second)
```
    Returns true if the centers of both clusters are significantly different compared to each other. Two centers are considered significantly different, if, for any i, the following is true:
    Math.abs(firts.getNumArray()[i] - second.getNumArray()[i]) > 1e-10*(first.getNumArray()[i] + other.getNumArray()[i])
    
    Parameters:
    
    first - the reference cluster
    
    second - the cluster to be compared to the reference
    
    Returns:
    
    true if the cluster centers are significantly different

Class KMeans

Constructor Summary

Method Summary

Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator

Methods inherited from class java.lang.Object

Constructor Detail

KMeans

Method Detail

getInput

getModel

getMaxIterations

setMaxIterations

getK

setK

getIncludedColumns

setIncludedColumns

getDistanceMeasure

setDistanceMeasure

computeMetadata

General

Input record ports

Output record ports (static metadata)

Input model ports

Output model ports (static metadata)

Output ports with dynamic metadata

createIterator

isSignificantlyDifferent