Search in sources :

Example 1 with RegressionEvaluator

use of org.apache.spark.ml.evaluation.RegressionEvaluator in project mmtf-spark by sbl-sdsc.

the class SparkRegressor method fit.

/**
 * Dataset must at least contain the following two columns:
 * label: the class labels
 * features: feature vector
 * @param data
 * @return map with metrics
 */
public Map<String, String> fit(Dataset<Row> data) {
    // Split the data into training and test sets (30% held out for testing)
    Dataset<Row>[] splits = data.randomSplit(new double[] { 1.0 - testFraction, testFraction }, seed);
    Dataset<Row> trainingData = splits[0];
    Dataset<Row> testData = splits[1];
    // Train a RandomForest model.
    predictor.setLabelCol(label).setFeaturesCol("features");
    // Chain indexer and forest in a Pipeline
    Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] { predictor });
    // Train model. This also runs the indexer.
    PipelineModel model = pipeline.fit(trainingData);
    // Make predictions.
    Dataset<Row> predictions = model.transform(testData);
    // Display some sample predictions
    System.out.println("Sample predictions: " + predictor.getClass().getSimpleName());
    String primaryKey = predictions.columns()[0];
    predictions.select(primaryKey, label, "prediction").sample(false, 0.1, seed).show(50);
    Map<String, String> metrics = new LinkedHashMap<>();
    metrics.put("Method", predictor.getClass().getSimpleName());
    // Select (prediction, true label) and compute test error
    RegressionEvaluator evaluator = new RegressionEvaluator().setLabelCol(label).setPredictionCol("prediction").setMetricName("rmse");
    metrics.put("rmse", Double.toString(evaluator.evaluate(predictions)));
    return metrics;
}
Also used : Dataset(org.apache.spark.sql.Dataset) Row(org.apache.spark.sql.Row) RegressionEvaluator(org.apache.spark.ml.evaluation.RegressionEvaluator) Pipeline(org.apache.spark.ml.Pipeline) PipelineModel(org.apache.spark.ml.PipelineModel) LinkedHashMap(java.util.LinkedHashMap)

Aggregations

LinkedHashMap (java.util.LinkedHashMap)1 Pipeline (org.apache.spark.ml.Pipeline)1 PipelineModel (org.apache.spark.ml.PipelineModel)1 RegressionEvaluator (org.apache.spark.ml.evaluation.RegressionEvaluator)1 Dataset (org.apache.spark.sql.Dataset)1 Row (org.apache.spark.sql.Row)1