Examples with KMeans - com.alibaba.alink.pipeline.clustering.KMeans

Example 1 with KMeans

use of com.alibaba.alink.pipeline.clustering.KMeans in project Alink by alibaba.

the class Chap19 method c_3.

static void c_3() throws Exception {
    AkSourceBatchOp source = new AkSourceBatchOp().setFilePath(DATA_DIR + SPARSE_TRAIN_FILE);
    source.link(new PcaTrainBatchOp().setK(39).setCalculationType(CalculationType.COV).setVectorCol(VECTOR_COL_NAME).lazyPrintModelInfo()).link(new AkSinkBatchOp().setFilePath(DATA_DIR + PCA_MODEL_FILE).setOverwriteSink(true));
    BatchOperator.execute();
    BatchOperator<?> pca_result = new PcaPredictBatchOp().setVectorCol(VECTOR_COL_NAME).setPredictionCol(VECTOR_COL_NAME).linkFrom(new AkSourceBatchOp().setFilePath(DATA_DIR + PCA_MODEL_FILE), source);
    Stopwatch sw = new Stopwatch();
    KMeans kmeans = new KMeans().setK(10).setVectorCol(VECTOR_COL_NAME).setPredictionCol(PREDICTION_COL_NAME);
    sw.reset();
    sw.start();
    kmeans.fit(source).transform(source).link(new EvalClusterBatchOp().setVectorCol(VECTOR_COL_NAME).setPredictionCol(PREDICTION_COL_NAME).setLabelCol(LABEL_COL_NAME).lazyPrintMetrics("KMeans"));
    BatchOperator.execute();
    sw.stop();
    System.out.println(sw.getElapsedTimeSpan());
    sw.reset();
    sw.start();
    kmeans.fit(pca_result).transform(pca_result).link(new EvalClusterBatchOp().setVectorCol(VECTOR_COL_NAME).setPredictionCol(PREDICTION_COL_NAME).setLabelCol(LABEL_COL_NAME).lazyPrintMetrics("KMeans + PCA"));
    BatchOperator.execute();
    sw.stop();
    System.out.println(sw.getElapsedTimeSpan());
}

Also used : AkSourceBatchOp(com.alibaba.alink.operator.batch.source.AkSourceBatchOp) KMeans(com.alibaba.alink.pipeline.clustering.KMeans) PcaPredictBatchOp(com.alibaba.alink.operator.batch.feature.PcaPredictBatchOp) PcaTrainBatchOp(com.alibaba.alink.operator.batch.feature.PcaTrainBatchOp) Stopwatch(com.alibaba.alink.common.utils.Stopwatch) AkSinkBatchOp(com.alibaba.alink.operator.batch.sink.AkSinkBatchOp) EvalClusterBatchOp(com.alibaba.alink.operator.batch.evaluation.EvalClusterBatchOp)

Example 2 with KMeans

use of com.alibaba.alink.pipeline.clustering.KMeans in project Alink by alibaba.

the class EvalClusterBatchOpTest method testNoVector.

@Test
public void testNoVector() throws Exception {
    MemSourceBatchOp inOp = new MemSourceBatchOp(Arrays.asList(rows), new String[] { "label", "Y" });
    KMeans train = new KMeans().setVectorCol("Y").setPredictionCol("pred").setK(2);
    ClusterMetrics metrics = new EvalClusterBatchOp().setPredictionCol("pred").linkFrom(train.fit(inOp).transform(inOp)).collectMetrics();
    Assert.assertEquals(metrics.getCount().intValue(), 6);
    Assert.assertArrayEquals(metrics.getClusterArray(), new String[] { "0", "1" });
}

Also used : MemSourceBatchOp(com.alibaba.alink.operator.batch.source.MemSourceBatchOp) KMeans(com.alibaba.alink.pipeline.clustering.KMeans) ClusterMetrics(com.alibaba.alink.operator.common.evaluation.ClusterMetrics) Test(org.junit.Test)

Example 3 with KMeans

use of com.alibaba.alink.pipeline.clustering.KMeans in project Alink by alibaba.

the class GridSearchCVTest method findBestCluster.

@Test
public void findBestCluster() {
    ColumnsToVector columnsToVector = new ColumnsToVector().setSelectedCols(colNames[0], colNames[1]).setVectorCol("vector");
    KMeans kMeans = new KMeans().setVectorCol("vector").setPredictionCol("pred");
    ParamGrid grid = new ParamGrid().addGrid(kMeans, KMeans.DISTANCE_TYPE, new HasKMeansDistanceType.DistanceType[] { EUCLIDEAN, COSINE });
    Pipeline pipeline = new Pipeline().add(columnsToVector).add(kMeans);
    GridSearchCV gridSearchCV = new GridSearchCV().setEstimator(pipeline).setParamGrid(grid).setNumFolds(2).setTuningEvaluator(new ClusterTuningEvaluator().setTuningClusterMetric(TuningClusterMetric.RI).setPredictionCol("pred").setVectorCol("vector").setLabelCol("label"));
    GridSearchCVModel model = gridSearchCV.fit(memSourceBatchOp);
    Assert.assertEquals(testArray.length, model.transform(memSourceBatchOp).collect().size());
}

Also used : KMeans(com.alibaba.alink.pipeline.clustering.KMeans) ColumnsToVector(com.alibaba.alink.pipeline.dataproc.format.ColumnsToVector) HasKMeansDistanceType(com.alibaba.alink.params.shared.clustering.HasKMeansDistanceType) Pipeline(com.alibaba.alink.pipeline.Pipeline) Test(org.junit.Test)

Example 4 with KMeans

use of com.alibaba.alink.pipeline.clustering.KMeans in project Alink by alibaba.

the class GridSearchTVSplitTest method findBestCluster.

@Test
public void findBestCluster() throws Exception {
    ColumnsToVector columnsToVector = new ColumnsToVector().setSelectedCols(colNames[0], colNames[1]).setVectorCol("vector");
    KMeans kMeans = new KMeans().setVectorCol("vector").setPredictionCol("pred");
    ParamGrid grid = new ParamGrid().addGrid(kMeans, "distanceType", new HasKMeansDistanceType.DistanceType[] { EUCLIDEAN, COSINE });
    Pipeline pipeline = new Pipeline().add(columnsToVector).add(kMeans);
    GridSearchTVSplit gridSearchTVSplit = new GridSearchTVSplit().setEstimator(pipeline).setParamGrid(grid).setTrainRatio(0.5).setTuningEvaluator(new ClusterTuningEvaluator().setTuningClusterMetric(TuningClusterMetric.RI).setPredictionCol("pred").setVectorCol("vector").setLabelCol("label"));
    GridSearchTVSplitModel model = gridSearchTVSplit.fit(memSourceBatchOp);
    Assert.assertEquals(testArray.length, model.transform(memSourceBatchOp).collect().size());
}

Example 5 with KMeans

use of com.alibaba.alink.pipeline.clustering.KMeans in project Alink by alibaba.

the class KMeansExample method main.

public static void main(String[] args) throws Exception {
    String URL = "https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/iris.csv";
    String SCHEMA_STR = "sepal_length double, sepal_width double, petal_length double, petal_width double, category string";
    BatchOperator data = new CsvSourceBatchOp().setFilePath(URL).setSchemaStr(SCHEMA_STR);
    VectorAssembler va = new VectorAssembler().setSelectedCols(new String[] { "sepal_length", "sepal_width", "petal_length", "petal_width" }).setOutputCol("features");
    KMeans kMeans = new KMeans().setVectorCol("features").setK(3).setPredictionCol("prediction_result").setPredictionDetailCol("prediction_detail").setReservedCols("category").setMaxIter(100);
    Pipeline pipeline = new Pipeline().add(va).add(kMeans);
    pipeline.fit(data).transform(data).print();
}

Also used : KMeans(com.alibaba.alink.pipeline.clustering.KMeans) VectorAssembler(com.alibaba.alink.pipeline.dataproc.vector.VectorAssembler) BatchOperator(com.alibaba.alink.operator.batch.BatchOperator) CsvSourceBatchOp(com.alibaba.alink.operator.batch.source.CsvSourceBatchOp) Pipeline(com.alibaba.alink.pipeline.Pipeline)

Aggregations

KMeans (com.alibaba.alink.pipeline.clustering.KMeans)10 Test (org.junit.Test)5 EvalClusterBatchOp (com.alibaba.alink.operator.batch.evaluation.EvalClusterBatchOp)4 AkSourceBatchOp (com.alibaba.alink.operator.batch.source.AkSourceBatchOp)4 Pipeline (com.alibaba.alink.pipeline.Pipeline)4 Stopwatch (com.alibaba.alink.common.utils.Stopwatch)3 MemSourceBatchOp (com.alibaba.alink.operator.batch.source.MemSourceBatchOp)3 AkSinkBatchOp (com.alibaba.alink.operator.batch.sink.AkSinkBatchOp)2 CsvSourceBatchOp (com.alibaba.alink.operator.batch.source.CsvSourceBatchOp)2 ClusterMetrics (com.alibaba.alink.operator.common.evaluation.ClusterMetrics)2 HasKMeansDistanceType (com.alibaba.alink.params.shared.clustering.HasKMeansDistanceType)2 BisectingKMeans (com.alibaba.alink.pipeline.clustering.BisectingKMeans)2 ColumnsToVector (com.alibaba.alink.pipeline.dataproc.format.ColumnsToVector)2 SparseVector (com.alibaba.alink.common.linalg.SparseVector)1 BatchOperator (com.alibaba.alink.operator.batch.BatchOperator)1 KMeansPredictBatchOp (com.alibaba.alink.operator.batch.clustering.KMeansPredictBatchOp)1 KMeansTrainBatchOp (com.alibaba.alink.operator.batch.clustering.KMeansTrainBatchOp)1 VectorAssemblerBatchOp (com.alibaba.alink.operator.batch.dataproc.vector.VectorAssemblerBatchOp)1 PcaPredictBatchOp (com.alibaba.alink.operator.batch.feature.PcaPredictBatchOp)1 PcaTrainBatchOp (com.alibaba.alink.operator.batch.feature.PcaTrainBatchOp)1