Search in sources :

Example 6 with LabeledPoint

use of org.apache.spark.mllib.regression.LabeledPoint in project spring-boot-quick by vector4wang.

the class EmailFilter method main.

public static void main(String[] args) {
    SparkConf conf = new SparkConf().setMaster("local").setAppName("垃圾邮件分类");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> ham = sc.textFile("D:\\githubspace\\springbootquick\\src\\main\\resources\\ham.txt");
    JavaRDD<String> spam = sc.textFile("D:\\githubspace\\springbootquick\\src\\main\\resources\\spam.txt");
    final HashingTF tf = new HashingTF(10000);
    JavaRDD<LabeledPoint> posExamples = spam.map(h -> new LabeledPoint(1, tf.transform(Arrays.asList(h.split(" ")))));
    JavaRDD<LabeledPoint> negExamples = ham.map(s -> new LabeledPoint(0, tf.transform(Arrays.asList(s.split(" ")))));
    JavaRDD<LabeledPoint> trainingData = posExamples.union(negExamples);
    trainingData.cache();
    LogisticRegressionWithSGD lrLearner = new LogisticRegressionWithSGD();
    LogisticRegressionModel model = lrLearner.run(trainingData.rdd());
    Vector posTestExample = tf.transform(Arrays.asList("O M G GET cheap stuff by sending money to ...".split(" ")));
    System.out.println(posTestExample.toJson());
    Vector negTestExample = tf.transform(Arrays.asList("Hi Dad, I started studying Spark the other ...".split(" ")));
    System.out.println("Prediction for positive test example: " + model.predict(posTestExample));
    System.out.println("Prediction for negative test example: " + model.predict(negTestExample));
}
Also used : HashingTF(org.apache.spark.mllib.feature.HashingTF) LogisticRegressionWithSGD(org.apache.spark.mllib.classification.LogisticRegressionWithSGD) LogisticRegressionModel(org.apache.spark.mllib.classification.LogisticRegressionModel) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) LabeledPoint(org.apache.spark.mllib.regression.LabeledPoint) SparkConf(org.apache.spark.SparkConf) Vector(org.apache.spark.mllib.linalg.Vector)

Example 7 with LabeledPoint

use of org.apache.spark.mllib.regression.LabeledPoint in project cdap by caskdata.

the class NaiveBayesTrainer method run.

@Override
public void run(SparkExecutionPluginContext sparkContext, JavaRDD<StructuredRecord> input) throws Exception {
    Preconditions.checkArgument(input.count() != 0, "Input RDD is empty.");
    final HashingTF tf = new HashingTF(100);
    JavaRDD<LabeledPoint> trainingData = input.map(new Function<StructuredRecord, LabeledPoint>() {

        @Override
        public LabeledPoint call(StructuredRecord record) throws Exception {
            // should never happen, here to test app correctness in unit tests
            if (inputSchema != null && !inputSchema.equals(record.getSchema())) {
                throw new IllegalStateException("runtime schema does not match what was set at configure time.");
            }
            String text = record.get(config.fieldToClassify);
            return new LabeledPoint((Double) record.get(config.predictionField), tf.transform(Lists.newArrayList(text.split(" "))));
        }
    });
    trainingData.cache();
    final NaiveBayesModel model = NaiveBayes.train(trainingData.rdd(), 1.0);
    // save the model to a file in the output FileSet
    JavaSparkContext javaSparkContext = sparkContext.getSparkContext();
    FileSet outputFS = sparkContext.getDataset(config.fileSetName);
    model.save(JavaSparkContext.toSparkContext(javaSparkContext), outputFS.getBaseLocation().append(config.path).toURI().getPath());
    JavaPairRDD<Long, String> textsToClassify = sparkContext.fromStream(TEXTS_TO_CLASSIFY, String.class);
    JavaRDD<Vector> featuresToClassify = textsToClassify.map(new Function<Tuple2<Long, String>, Vector>() {

        @Override
        public Vector call(Tuple2<Long, String> longWritableTextTuple2) throws Exception {
            String text = longWritableTextTuple2._2();
            return tf.transform(Lists.newArrayList(text.split(" ")));
        }
    });
    JavaRDD<Double> predict = model.predict(featuresToClassify);
    LOG.info("Predictions: {}", predict.collect());
    // key the predictions with the message
    JavaPairRDD<String, Double> keyedPredictions = textsToClassify.values().zip(predict);
    // convert to byte[],byte[] to write to data
    JavaPairRDD<byte[], byte[]> bytesRDD = keyedPredictions.mapToPair(new PairFunction<Tuple2<String, Double>, byte[], byte[]>() {

        @Override
        public Tuple2<byte[], byte[]> call(Tuple2<String, Double> tuple) throws Exception {
            return new Tuple2<>(Bytes.toBytes(tuple._1()), Bytes.toBytes(tuple._2()));
        }
    });
    sparkContext.saveAsDataset(bytesRDD, CLASSIFIED_TEXTS);
}
Also used : LabeledPoint(org.apache.spark.mllib.regression.LabeledPoint) NaiveBayesModel(org.apache.spark.mllib.classification.NaiveBayesModel) StructuredRecord(co.cask.cdap.api.data.format.StructuredRecord) HashingTF(org.apache.spark.mllib.feature.HashingTF) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) Vector(org.apache.spark.mllib.linalg.Vector) FileSet(co.cask.cdap.api.dataset.lib.FileSet) Tuple2(scala.Tuple2)

Example 8 with LabeledPoint

use of org.apache.spark.mllib.regression.LabeledPoint in project learning-spark by databricks.

the class MLlib method main.

public static void main(String[] args) {
    SparkConf sparkConf = new SparkConf().setAppName("JavaBookExample");
    JavaSparkContext sc = new JavaSparkContext(sparkConf);
    // Load 2 types of emails from text files: spam and ham (non-spam).
    // Each line has text from one email.
    JavaRDD<String> spam = sc.textFile("files/spam.txt");
    JavaRDD<String> ham = sc.textFile("files/ham.txt");
    // Create a HashingTF instance to map email text to vectors of 100 features.
    final HashingTF tf = new HashingTF(100);
    // Each email is split into words, and each word is mapped to one feature.
    // Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
    JavaRDD<LabeledPoint> positiveExamples = spam.map(new Function<String, LabeledPoint>() {

        @Override
        public LabeledPoint call(String email) {
            return new LabeledPoint(1, tf.transform(Arrays.asList(email.split(" "))));
        }
    });
    JavaRDD<LabeledPoint> negativeExamples = ham.map(new Function<String, LabeledPoint>() {

        @Override
        public LabeledPoint call(String email) {
            return new LabeledPoint(0, tf.transform(Arrays.asList(email.split(" "))));
        }
    });
    JavaRDD<LabeledPoint> trainingData = positiveExamples.union(negativeExamples);
    // Cache data since Logistic Regression is an iterative algorithm.
    trainingData.cache();
    // Create a Logistic Regression learner which uses the LBFGS optimizer.
    LogisticRegressionWithSGD lrLearner = new LogisticRegressionWithSGD();
    // Run the actual learning algorithm on the training data.
    LogisticRegressionModel model = lrLearner.run(trainingData.rdd());
    // Test on a positive example (spam) and a negative one (ham).
    // First apply the same HashingTF feature transformation used on the training data.
    Vector posTestExample = tf.transform(Arrays.asList("O M G GET cheap stuff by sending money to ...".split(" ")));
    Vector negTestExample = tf.transform(Arrays.asList("Hi Dad, I started studying Spark the other ...".split(" ")));
    // Now use the learned model to predict spam/ham for new emails.
    System.out.println("Prediction for positive test example: " + model.predict(posTestExample));
    System.out.println("Prediction for negative test example: " + model.predict(negTestExample));
    sc.stop();
}
Also used : HashingTF(org.apache.spark.mllib.feature.HashingTF) LogisticRegressionWithSGD(org.apache.spark.mllib.classification.LogisticRegressionWithSGD) LogisticRegressionModel(org.apache.spark.mllib.classification.LogisticRegressionModel) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) LabeledPoint(org.apache.spark.mllib.regression.LabeledPoint) SparkConf(org.apache.spark.SparkConf) Vector(org.apache.spark.mllib.linalg.Vector)

Example 9 with LabeledPoint

use of org.apache.spark.mllib.regression.LabeledPoint in project deeplearning4j by deeplearning4j.

the class MLLibUtil method toLabeledPoint.

/**
     * Convert a dataset (feature vector) to a labeled point
     * @param point the point to convert
     * @return the labeled point derived from this dataset
     */
private static LabeledPoint toLabeledPoint(DataSet point) {
    if (!point.getFeatureMatrix().isVector()) {
        throw new IllegalArgumentException("Feature matrix must be a vector");
    }
    Vector features = toVector(point.getFeatureMatrix().dup());
    double label = Nd4j.getBlasWrapper().iamax(point.getLabels());
    return new LabeledPoint(label, features);
}
Also used : LabeledPoint(org.apache.spark.mllib.regression.LabeledPoint) Vector(org.apache.spark.mllib.linalg.Vector)

Example 10 with LabeledPoint

use of org.apache.spark.mllib.regression.LabeledPoint in project deeplearning4j by deeplearning4j.

the class MLLibUtil method pointOf.

/**
     * Returns a labeled point of the writables
     * where the final item is the point and the rest of the items are
     * features
     * @param writables the writables
     * @return the labeled point
     */
public static LabeledPoint pointOf(Collection<Writable> writables) {
    double[] ret = new double[writables.size() - 1];
    int count = 0;
    double target = 0;
    for (Writable w : writables) {
        if (count < writables.size() - 1)
            ret[count++] = Float.parseFloat(w.toString());
        else
            target = Float.parseFloat(w.toString());
    }
    if (target < 0)
        throw new IllegalStateException("Target must be >= 0");
    return new LabeledPoint(target, Vectors.dense(ret));
}
Also used : Writable(org.datavec.api.writable.Writable) LabeledPoint(org.apache.spark.mllib.regression.LabeledPoint) LabeledPoint(org.apache.spark.mllib.regression.LabeledPoint)

Aggregations

LabeledPoint (org.apache.spark.mllib.regression.LabeledPoint)15 JavaSparkContext (org.apache.spark.api.java.JavaSparkContext)6 DataSet (org.nd4j.linalg.dataset.DataSet)6 SparkConf (org.apache.spark.SparkConf)4 LogisticRegressionModel (org.apache.spark.mllib.classification.LogisticRegressionModel)4 Vector (org.apache.spark.mllib.linalg.Vector)4 DateFormat (java.text.DateFormat)3 JavaRDD (org.apache.spark.api.java.JavaRDD)3 HashingTF (org.apache.spark.mllib.feature.HashingTF)3 BoostingStrategy (org.apache.spark.mllib.tree.configuration.BoostingStrategy)3 IrisDataSetIterator (org.deeplearning4j.datasets.iterator.impl.IrisDataSetIterator)3 BaseSparkTest (org.deeplearning4j.spark.BaseSparkTest)3 Tuple2 (scala.Tuple2)3 CmdlineParser (de.tototec.cmdoption.CmdlineParser)2 java.util (java.util)2 JavaPairRDD (org.apache.spark.api.java.JavaPairRDD)2 PairFunction (org.apache.spark.api.java.function.PairFunction)2 org.apache.spark.ml.feature (org.apache.spark.ml.feature)2 LogisticRegressionWithLBFGS (org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS)2 LogisticRegressionWithSGD (org.apache.spark.mllib.classification.LogisticRegressionWithSGD)2