Search in sources :

Example 1 with Tokenizer

use of org.apache.flink.test.testfunctions.Tokenizer in project flink by apache.

the class TextOutputFormatITCase method testProgram.

@Override
protected void testProgram() throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStream<String> text = env.fromElements(WordCountData.TEXT);
    DataStream<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer()).keyBy(0).sum(1);
    counts.writeAsText(resultPath);
    env.execute("WriteAsTextTest");
}
Also used : Tuple2(org.apache.flink.api.java.tuple.Tuple2) StreamExecutionEnvironment(org.apache.flink.streaming.api.environment.StreamExecutionEnvironment) Tokenizer(org.apache.flink.test.testfunctions.Tokenizer)

Example 2 with Tokenizer

use of org.apache.flink.test.testfunctions.Tokenizer in project flink by apache.

the class WordCountMapredITCase method internalRun.

private void internalRun(boolean isTestDeprecatedAPI) throws Exception {
    final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    DataSet<Tuple2<LongWritable, Text>> input;
    if (isTestDeprecatedAPI) {
        input = env.readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath);
    } else {
        input = env.createInput(readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath));
    }
    DataSet<String> text = input.map(new MapFunction<Tuple2<LongWritable, Text>, String>() {

        @Override
        public String map(Tuple2<LongWritable, Text> value) throws Exception {
            return value.f1.toString();
        }
    });
    DataSet<Tuple2<String, Integer>> counts = // split up the lines in pairs (2-tuples) containing: (word,1)
    text.flatMap(new Tokenizer()).groupBy(0).sum(1);
    DataSet<Tuple2<Text, LongWritable>> words = counts.map(new MapFunction<Tuple2<String, Integer>, Tuple2<Text, LongWritable>>() {

        @Override
        public Tuple2<Text, LongWritable> map(Tuple2<String, Integer> value) throws Exception {
            return new Tuple2<Text, LongWritable>(new Text(value.f0), new LongWritable(value.f1));
        }
    });
    // Set up Hadoop Output Format
    HadoopOutputFormat<Text, LongWritable> hadoopOutputFormat = new HadoopOutputFormat<Text, LongWritable>(new TextOutputFormat<Text, LongWritable>(), new JobConf());
    hadoopOutputFormat.getJobConf().set("mapred.textoutputformat.separator", " ");
    TextOutputFormat.setOutputPath(hadoopOutputFormat.getJobConf(), new Path(resultPath));
    // Output & Execute
    words.output(hadoopOutputFormat);
    env.execute("Hadoop Compat WordCount");
}
Also used : Path(org.apache.hadoop.fs.Path) ExecutionEnvironment(org.apache.flink.api.java.ExecutionEnvironment) Text(org.apache.hadoop.io.Text) HadoopOutputFormat(org.apache.flink.api.java.hadoop.mapred.HadoopOutputFormat) TextInputFormat(org.apache.hadoop.mapred.TextInputFormat) Tuple2(org.apache.flink.api.java.tuple.Tuple2) LongWritable(org.apache.hadoop.io.LongWritable) Tokenizer(org.apache.flink.test.testfunctions.Tokenizer) JobConf(org.apache.hadoop.mapred.JobConf)

Example 3 with Tokenizer

use of org.apache.flink.test.testfunctions.Tokenizer in project flink by apache.

the class WordCountMapreduceITCase method internalRun.

private void internalRun(boolean isTestDeprecatedAPI) throws Exception {
    final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    DataSet<Tuple2<LongWritable, Text>> input;
    if (isTestDeprecatedAPI) {
        input = env.readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath);
    } else {
        input = env.createInput(readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath));
    }
    DataSet<String> text = input.map(new MapFunction<Tuple2<LongWritable, Text>, String>() {

        @Override
        public String map(Tuple2<LongWritable, Text> value) throws Exception {
            return value.f1.toString();
        }
    });
    DataSet<Tuple2<String, Integer>> counts = // split up the lines in pairs (2-tuples) containing: (word,1)
    text.flatMap(new Tokenizer()).groupBy(0).sum(1);
    DataSet<Tuple2<Text, LongWritable>> words = counts.map(new MapFunction<Tuple2<String, Integer>, Tuple2<Text, LongWritable>>() {

        @Override
        public Tuple2<Text, LongWritable> map(Tuple2<String, Integer> value) throws Exception {
            return new Tuple2<Text, LongWritable>(new Text(value.f0), new LongWritable(value.f1));
        }
    });
    // Set up Hadoop Output Format
    Job job = Job.getInstance();
    HadoopOutputFormat<Text, LongWritable> hadoopOutputFormat = new HadoopOutputFormat<Text, LongWritable>(new TextOutputFormat<Text, LongWritable>(), job);
    job.getConfiguration().set("mapred.textoutputformat.separator", " ");
    TextOutputFormat.setOutputPath(job, new Path(resultPath));
    // Output & Execute
    words.output(hadoopOutputFormat);
    env.execute("Hadoop Compat WordCount");
}
Also used : Path(org.apache.hadoop.fs.Path) ExecutionEnvironment(org.apache.flink.api.java.ExecutionEnvironment) Text(org.apache.hadoop.io.Text) HadoopOutputFormat(org.apache.flink.api.java.hadoop.mapreduce.HadoopOutputFormat) TextInputFormat(org.apache.hadoop.mapreduce.lib.input.TextInputFormat) Tuple2(org.apache.flink.api.java.tuple.Tuple2) LongWritable(org.apache.hadoop.io.LongWritable) Job(org.apache.hadoop.mapreduce.Job) Tokenizer(org.apache.flink.test.testfunctions.Tokenizer)

Example 4 with Tokenizer

use of org.apache.flink.test.testfunctions.Tokenizer in project flink by apache.

the class LocalExecutorITCase method getWordCountPlan.

private Plan getWordCountPlan(File inFile, File outFile, int parallelism) {
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(parallelism);
    env.readTextFile(inFile.getAbsolutePath()).flatMap(new Tokenizer()).groupBy(0).sum(1).writeAsCsv(outFile.getAbsolutePath());
    return env.createProgramPlan();
}
Also used : ExecutionEnvironment(org.apache.flink.api.java.ExecutionEnvironment) Tokenizer(org.apache.flink.test.testfunctions.Tokenizer)

Example 5 with Tokenizer

use of org.apache.flink.test.testfunctions.Tokenizer in project flink by apache.

the class CsvOutputFormatITCase method testProgram.

@Override
protected void testProgram() throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStream<String> text = env.fromElements(WordCountData.TEXT);
    DataStream<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer()).keyBy(0).sum(1);
    counts.writeAsCsv(resultPath);
    env.execute("WriteAsCsvTest");
}
Also used : Tuple2(org.apache.flink.api.java.tuple.Tuple2) StreamExecutionEnvironment(org.apache.flink.streaming.api.environment.StreamExecutionEnvironment) Tokenizer(org.apache.flink.test.testfunctions.Tokenizer)

Aggregations

Tokenizer (org.apache.flink.test.testfunctions.Tokenizer)5 Tuple2 (org.apache.flink.api.java.tuple.Tuple2)4 ExecutionEnvironment (org.apache.flink.api.java.ExecutionEnvironment)3 StreamExecutionEnvironment (org.apache.flink.streaming.api.environment.StreamExecutionEnvironment)2 Path (org.apache.hadoop.fs.Path)2 LongWritable (org.apache.hadoop.io.LongWritable)2 Text (org.apache.hadoop.io.Text)2 HadoopOutputFormat (org.apache.flink.api.java.hadoop.mapred.HadoopOutputFormat)1 HadoopOutputFormat (org.apache.flink.api.java.hadoop.mapreduce.HadoopOutputFormat)1 JobConf (org.apache.hadoop.mapred.JobConf)1 TextInputFormat (org.apache.hadoop.mapred.TextInputFormat)1 Job (org.apache.hadoop.mapreduce.Job)1 TextInputFormat (org.apache.hadoop.mapreduce.lib.input.TextInputFormat)1