- java.lang.Object
-
- com.pervasive.datarush.operators.AbstractLogicalOperator
-
- com.pervasive.datarush.operators.IterativeOperator
-
- com.pervasive.datarush.analytics.cluster.kmeans.KMeans
-
- All Implemented Interfaces:
LogicalOperator
,Serializable
public final class KMeans extends IterativeOperator implements Serializable
Computes clustering model for the given input based on the k-Means algorithm. All included fields of the given input must be of type double, float, long or int. The k-Means algorithm chosesk
random data points as the initial cluster centers. Within each iteration, it assigns input data points to current clusters based on the distance to the cluster centers and recomputes the centers based on the new assignments. Computation stops when one of the following conditions is true:maxIterations
is exceeded- the cluster centers do not change significantly between two iterations
(see
#isSignificantlyDifferent()
)
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description KMeans()
Create a new KMeans operator, initializing settings to their default values.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
computeMetadata(IterativeMetadataContext ctx)
Implementations must adhere to the following contractsprotected CompositionIterator
createIterator(MetadataContext ctx)
Invoked at the start of execution.DistanceMeasure
getDistanceMeasure()
Returns the distance measure used to measure the distance between two points when building the model.List<String>
getIncludedColumns()
Returns the list of columns to include for k-Means.RecordPort
getInput()
Returns the input port.int
getK()
Returns the "k" value, where k is the number of centroids to compute.int
getMaxIterations()
Returns the maximum number of iterations.PMMLPort
getModel()
Returns the model port.static boolean
isSignificantlyDifferent(Cluster first, Cluster second)
Returns true if the centers of both clusters are significantly different compared to each other.void
setDistanceMeasure(DistanceMeasure distanceMeasure)
Sets the distance measure used to measure the distance between two points when building the model.void
setIncludedColumns(List<String> includedColumns)
Sets the list of columns to include for k-Means.void
setK(int k)
Sets the "k" value, where k is the number of centroids to compute.void
setMaxIterations(int max)
Sets the maximum number of iterations.-
Methods inherited from class com.pervasive.datarush.operators.AbstractLogicalOperator
disableParallelism, getInputPorts, getOutputPorts, newInput, newInput, newOutput, newRecordInput, newRecordInput, newRecordOutput, notifyError
-
-
-
-
Method Detail
-
getInput
public RecordPort getInput()
Returns the input port. This is the learning data from which the model is built.- Returns:
- the input port
-
getModel
public PMMLPort getModel()
Returns the model port. This is the model which is built from the learning data.- Returns:
- the model port
-
getMaxIterations
public int getMaxIterations()
Returns the maximum number of iterations. Defaults to 99.- Returns:
- The maximum number of iterations.
-
setMaxIterations
public void setMaxIterations(int max)
Sets the maximum number of iterations. Defaults to 99.- Parameters:
max
- The maximum number of iterations.
-
getK
public int getK()
Returns the "k" value, where k is the number of centroids to compute. Defaults to 3.- Returns:
- the "k" value.
-
setK
public void setK(int k)
Sets the "k" value, where k is the number of centroids to compute. Defaults to 3.- Parameters:
k
- the "k" value.
-
getIncludedColumns
public List<String> getIncludedColumns()
Returns the list of columns to include for k-Means. An empty list means all columns of type double.- Returns:
- The list of columns to include
-
setIncludedColumns
public void setIncludedColumns(List<String> includedColumns)
Sets the list of columns to include for k-Means. An empty list means all columns of type double.- Parameters:
includedColumns
- The list of columns to include
-
getDistanceMeasure
public DistanceMeasure getDistanceMeasure()
Returns the distance measure used to measure the distance between two points when building the model.- Returns:
- the distance measure
-
setDistanceMeasure
public void setDistanceMeasure(DistanceMeasure distanceMeasure)
Sets the distance measure used to measure the distance between two points when building the model.- Parameters:
distanceMeasure
- the distance measure
-
computeMetadata
protected void computeMetadata(IterativeMetadataContext ctx)
Description copied from class:IterativeOperator
Implementations must adhere to the following contractsGeneral
Regardless of input ports/output port types, all implementations must do the following:- Validation. Validation of configuration should always be performed first.
- Declare operator parallelizability. Implementations must declare by calling
IterativeMetadataContext.parallelize(ParallelismStrategy)
. - Declare output port parallelizablility. Implementations must declare by calling
IterativeMetadataContext.setOutputParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
- Declare input port parallelizablility. Implementations must declare by calling
IterativeMetadataContext.setIterationParallelizable(com.pervasive.datarush.ports.LogicalPort, boolean)
.
MetadataUtil#negotiateParallelismBasedOnSourceAssumingParallelizableRecords
Input record ports
Implementations with input record ports must declare the following:- Required data ordering: Implementations that have data ordering requirements must declare them by calling
- Required data distribution (only applies to parallelizable input ports): Implementations that have data distribution requirements must declare them by calling
RecordPort#setRequiredDataOrdering
, otherwise iteration will proceed on an input dataset whose order is undefined.RecordPort#setRequiredDataDistribution
, otherwise iteration will proceed on an input dataset whose distribution is theunspecified partial distribution
.Output record ports (static metadata)
Implementations with output record ports must declare the following:- Type: Implementations must declare their output type by calling
RecordPort#setType
.
- Output data ordering: Implementations that can make guarantees as to their output
ordering may do so by calling
RecordPort#setOutputDataOrdering
- Output data distribution (only applies to parallelizable output ports): Implementations that can make guarantees as to their output
distribution may do so by calling
RecordPort#setOutputDataDistribution
Input model ports
In general, iterative operators will tend not to have model input ports, but if so, there is nothing special to declare for input model ports. Models are implicitly duplicated to all partitions when going from non-parallel to parallel ports.Output model ports (static metadata)
SimpleModelPort
's have no associated metadata and therefore there is never any output metadata to declare.PMMLPort
's, on the other hand, do have associated metadata. For all PMMLPorts, implementations must declare the following:- pmmlModelSpec: Implementations must declare the PMML model spec
by calling
PMMLPort.setPMMLModelSpec
.
Output ports with dynamic metadata
If an output port has dynamic metadata, implementations can declare by callingIterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)
. In the case that metadata is dynamic, calls toRecordPort#setType
,RecordPort#setOutputDataOrdering
, etc are not allowed and thus the sections above entitled "Output record ports (static metadata)" and "Output model ports (static metadata)" must be skipped. Note that, if possible, dynamic metadata should be avoided (seeIterativeMetadataContext.setOutputMetadataDynamic(com.pervasive.datarush.ports.LogicalPort, boolean)
).- Specified by:
computeMetadata
in classIterativeOperator
- Parameters:
ctx
- the context
-
createIterator
protected CompositionIterator createIterator(MetadataContext ctx)
Description copied from class:IterativeOperator
Invoked at the start of execution. The iterator is expected to return a handle that is then used for execution.- Specified by:
createIterator
in classIterativeOperator
- Parameters:
ctx
- a context in which the iterative operator can find input port metadata, etc. this information was available in the previous call toIterativeOperator.computeMetadata(IterativeMetadataContext)
, but is available here as well so that the iterative operator need not cache any metadata in its instance variables.- Returns:
- a handle that is used for iteration
-
isSignificantlyDifferent
public static boolean isSignificantlyDifferent(Cluster first, Cluster second)
Returns true if the centers of both clusters are significantly different compared to each other. Two centers are considered significantly different, if, for anyi
, the following is true:Math.abs(firts.getNumArray()[i] - second.getNumArray()[i]) > 1e-10*(first.getNumArray()[i] + other.getNumArray()[i])
- Parameters:
first
- the reference clustersecond
- the cluster to be compared to the reference- Returns:
- true if the cluster centers are significantly different
-
-