Search in sources :

Example 1 with CLI

use of org.apache.flink.streaming.examples.wordcount.util.CLI in project flink by apache.

the class WindowWordCount method main.

// *************************************************************************
// PROGRAM
// *************************************************************************
public static void main(String[] args) throws Exception {
    final CLI params = CLI.fromArgs(args);
    // Create the execution environment. This is the main entrypoint
    // to building a Flink application.
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    // Apache Flink’s unified approach to stream and batch processing means that a DataStream
    // application executed over bounded input will produce the same final results regardless
    // of the configured execution mode. It is important to note what final means here: a job
    // executing in STREAMING mode might produce incremental updates (think upserts in
    // a database) while a BATCH job would only produce one final result at the end. The final
    // result will be the same if interpreted correctly, but getting there can be different.
    // 
    // The “classic” execution behavior of the DataStream API is called STREAMING execution
    // mode. Applications should use streaming execution for unbounded jobs that require
    // continuous incremental processing and are expected to stay online indefinitely.
    // 
    // By enabling BATCH execution, we allow Flink to apply additional optimizations that we
    // can only do when we know that our input is bounded. For example, different
    // join/aggregation strategies can be used, in addition to a different shuffle
    // implementation that allows more efficient task scheduling and failure recovery behavior.
    // 
    // By setting the runtime mode to AUTOMATIC, Flink will choose BATCH  if all sources
    // are bounded and otherwise STREAMING.
    env.setRuntimeMode(params.getExecutionMode());
    // This optional step makes the input parameters
    // available in the Flink UI.
    env.getConfig().setGlobalJobParameters(params);
    DataStream<String> text;
    if (params.getInputs().isPresent()) {
        // Create a new file source that will read files from a given set of directories.
        // Each file will be processed as plain text and split based on newlines.
        FileSource.FileSourceBuilder<String> builder = FileSource.forRecordStreamFormat(new TextLineInputFormat(), params.getInputs().get());
        // If a discovery interval is provided, the source will
        // continuously watch the given directories for new files.
        params.getDiscoveryInterval().ifPresent(builder::monitorContinuously);
        text = env.fromSource(builder.build(), WatermarkStrategy.noWatermarks(), "file-input");
    } else {
        text = env.fromElements(WordCountData.WORDS).name("in-memory-input");
    }
    int windowSize = params.getInt("window").orElse(250);
    int slideSize = params.getInt("slide").orElse(150);
    DataStream<Tuple2<String, Integer>> counts = // will output each words as a (2-tuple) containing (word, 1)
    text.flatMap(new WordCount.Tokenizer()).name("tokenizer").keyBy(value -> value.f0).countWindow(windowSize, slideSize).sum(1).name("counter");
    if (params.getOutput().isPresent()) {
        // Given an output directory, Flink will write the results to a file
        // using a simple string encoding. In a production environment, this might
        // be something more structured like CSV, Avro, JSON, or Parquet.
        counts.sinkTo(FileSink.<Tuple2<String, Integer>>forRowFormat(params.getOutput().get(), new SimpleStringEncoder<>()).withRollingPolicy(DefaultRollingPolicy.builder().withMaxPartSize(MemorySize.ofMebiBytes(1)).withRolloverInterval(Duration.ofSeconds(10)).build()).build()).name("file-sink");
    } else {
        counts.print().name("print-sink");
    }
    // Apache Flink applications are composed lazily. Calling execute
    // submits the Job and begins processing.
    env.execute("WindowWordCount");
}
Also used : Tuple2(org.apache.flink.api.java.tuple.Tuple2) WordCount(org.apache.flink.streaming.examples.wordcount.WordCount) WatermarkStrategy(org.apache.flink.api.common.eventtime.WatermarkStrategy) FileSink(org.apache.flink.connector.file.sink.FileSink) MemorySize(org.apache.flink.configuration.MemorySize) FileSource(org.apache.flink.connector.file.src.FileSource) DataStream(org.apache.flink.streaming.api.datastream.DataStream) TextLineInputFormat(org.apache.flink.connector.file.src.reader.TextLineInputFormat) SimpleStringEncoder(org.apache.flink.api.common.serialization.SimpleStringEncoder) DefaultRollingPolicy(org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy) Duration(java.time.Duration) CLI(org.apache.flink.streaming.examples.wordcount.util.CLI) StreamExecutionEnvironment(org.apache.flink.streaming.api.environment.StreamExecutionEnvironment) WordCountData(org.apache.flink.streaming.examples.wordcount.util.WordCountData) CLI(org.apache.flink.streaming.examples.wordcount.util.CLI) TextLineInputFormat(org.apache.flink.connector.file.src.reader.TextLineInputFormat) FileSource(org.apache.flink.connector.file.src.FileSource) Tuple2(org.apache.flink.api.java.tuple.Tuple2) StreamExecutionEnvironment(org.apache.flink.streaming.api.environment.StreamExecutionEnvironment)

Example 2 with CLI

use of org.apache.flink.streaming.examples.wordcount.util.CLI in project flink by apache.

the class WordCount method main.

// *************************************************************************
// PROGRAM
// *************************************************************************
public static void main(String[] args) throws Exception {
    final CLI params = CLI.fromArgs(args);
    // Create the execution environment. This is the main entrypoint
    // to building a Flink application.
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    // Apache Flink’s unified approach to stream and batch processing means that a DataStream
    // application executed over bounded input will produce the same final results regardless
    // of the configured execution mode. It is important to note what final means here: a job
    // executing in STREAMING mode might produce incremental updates (think upserts in
    // a database) while in BATCH mode, it would only produce one final result at the end. The
    // final result will be the same if interpreted correctly, but getting there can be
    // different.
    // 
    // The “classic” execution behavior of the DataStream API is called STREAMING execution
    // mode. Applications should use streaming execution for unbounded jobs that require
    // continuous incremental processing and are expected to stay online indefinitely.
    // 
    // By enabling BATCH execution, we allow Flink to apply additional optimizations that we
    // can only do when we know that our input is bounded. For example, different
    // join/aggregation strategies can be used, in addition to a different shuffle
    // implementation that allows more efficient task scheduling and failure recovery behavior.
    // 
    // By setting the runtime mode to AUTOMATIC, Flink will choose BATCH if all sources
    // are bounded and otherwise STREAMING.
    env.setRuntimeMode(params.getExecutionMode());
    // This optional step makes the input parameters
    // available in the Flink UI.
    env.getConfig().setGlobalJobParameters(params);
    DataStream<String> text;
    if (params.getInputs().isPresent()) {
        // Create a new file source that will read files from a given set of directories.
        // Each file will be processed as plain text and split based on newlines.
        FileSource.FileSourceBuilder<String> builder = FileSource.forRecordStreamFormat(new TextLineInputFormat(), params.getInputs().get());
        // If a discovery interval is provided, the source will
        // continuously watch the given directories for new files.
        params.getDiscoveryInterval().ifPresent(builder::monitorContinuously);
        text = env.fromSource(builder.build(), WatermarkStrategy.noWatermarks(), "file-input");
    } else {
        text = env.fromElements(WordCountData.WORDS).name("in-memory-input");
    }
    DataStream<Tuple2<String, Integer>> counts = // will output each word as a (2-tuple) containing (word, 1)
    text.flatMap(new Tokenizer()).name("tokenizer").keyBy(value -> value.f0).sum(1).name("counter");
    if (params.getOutput().isPresent()) {
        // Given an output directory, Flink will write the results to a file
        // using a simple string encoding. In a production environment, this might
        // be something more structured like CSV, Avro, JSON, or Parquet.
        counts.sinkTo(FileSink.<Tuple2<String, Integer>>forRowFormat(params.getOutput().get(), new SimpleStringEncoder<>()).withRollingPolicy(DefaultRollingPolicy.builder().withMaxPartSize(MemorySize.ofMebiBytes(1)).withRolloverInterval(Duration.ofSeconds(10)).build()).build()).name("file-sink");
    } else {
        counts.print().name("print-sink");
    }
    // Apache Flink applications are composed lazily. Calling execute
    // submits the Job and begins processing.
    env.execute("WordCount");
}
Also used : CLI(org.apache.flink.streaming.examples.wordcount.util.CLI) TextLineInputFormat(org.apache.flink.connector.file.src.reader.TextLineInputFormat) Tuple2(org.apache.flink.api.java.tuple.Tuple2) FileSource(org.apache.flink.connector.file.src.FileSource) StreamExecutionEnvironment(org.apache.flink.streaming.api.environment.StreamExecutionEnvironment)

Example 3 with CLI

use of org.apache.flink.streaming.examples.wordcount.util.CLI in project flink by apache.

the class TopSpeedWindowing method main.

// *************************************************************************
// PROGRAM
// *************************************************************************
public static void main(String[] args) throws Exception {
    final CLI params = CLI.fromArgs(args);
    // Create the execution environment. This is the main entrypoint
    // to building a Flink application.
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    // Apache Flink’s unified approach to stream and batch processing means that a DataStream
    // application executed over bounded input will produce the same final results regardless
    // of the configured execution mode. It is important to note what final means here: a job
    // executing in STREAMING mode might produce incremental updates (think upserts in
    // a database) while a BATCH job would only produce one final result at the end. The final
    // result will be the same if interpreted correctly, but getting there can be different.
    // 
    // The “classic” execution behavior of the DataStream API is called STREAMING execution
    // mode. Applications should use streaming execution for unbounded jobs that require
    // continuous incremental processing and are expected to stay online indefinitely.
    // 
    // By enabling BATCH execution, we allow Flink to apply additional optimizations that we
    // can only do when we know that our input is bounded. For example, different
    // join/aggregation strategies can be used, in addition to a different shuffle
    // implementation that allows more efficient task scheduling and failure recovery behavior.
    // 
    // By setting the runtime mode to AUTOMATIC, Flink will choose BATCH  if all sources
    // are bounded and otherwise STREAMING.
    env.setRuntimeMode(params.getExecutionMode());
    // This optional step makes the input parameters
    // available in the Flink UI.
    env.getConfig().setGlobalJobParameters(params);
    DataStream<Tuple4<Integer, Integer, Double, Long>> carData;
    if (params.getInputs().isPresent()) {
        // Create a new file source that will read files from a given set of directories.
        // Each file will be processed as plain text and split based on newlines.
        FileSource.FileSourceBuilder<String> builder = FileSource.forRecordStreamFormat(new TextLineInputFormat(), params.getInputs().get());
        // If a discovery interval is provided, the source will
        // continuously watch the given directories for new files.
        params.getDiscoveryInterval().ifPresent(builder::monitorContinuously);
        carData = env.fromSource(builder.build(), WatermarkStrategy.noWatermarks(), "file-input").map(new ParseCarData()).name("parse-input");
    } else {
        carData = env.addSource(CarSource.create(2)).name("in-memory-source");
    }
    int evictionSec = 10;
    double triggerMeters = 50;
    DataStream<Tuple4<Integer, Integer, Double, Long>> topSpeeds = carData.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple4<Integer, Integer, Double, Long>>forMonotonousTimestamps().withTimestampAssigner((car, ts) -> car.f3)).keyBy(value -> value.f0).window(GlobalWindows.create()).evictor(TimeEvictor.of(Time.of(evictionSec, TimeUnit.SECONDS))).trigger(DeltaTrigger.of(triggerMeters, new DeltaFunction<Tuple4<Integer, Integer, Double, Long>>() {

        private static final long serialVersionUID = 1L;

        @Override
        public double getDelta(Tuple4<Integer, Integer, Double, Long> oldDataPoint, Tuple4<Integer, Integer, Double, Long> newDataPoint) {
            return newDataPoint.f2 - oldDataPoint.f2;
        }
    }, carData.getType().createSerializer(env.getConfig()))).maxBy(1);
    if (params.getOutput().isPresent()) {
        // Given an output directory, Flink will write the results to a file
        // using a simple string encoding. In a production environment, this might
        // be something more structured like CSV, Avro, JSON, or Parquet.
        topSpeeds.sinkTo(FileSink.<Tuple4<Integer, Integer, Double, Long>>forRowFormat(params.getOutput().get(), new SimpleStringEncoder<>()).withRollingPolicy(DefaultRollingPolicy.builder().withMaxPartSize(MemorySize.ofMebiBytes(1)).withRolloverInterval(Duration.ofSeconds(10)).build()).build()).name("file-sink");
    } else {
        topSpeeds.print();
    }
    env.execute("CarTopSpeedWindowingExample");
}
Also used : CLI(org.apache.flink.streaming.examples.wordcount.util.CLI) TextLineInputFormat(org.apache.flink.connector.file.src.reader.TextLineInputFormat) FileSource(org.apache.flink.connector.file.src.FileSource) Tuple4(org.apache.flink.api.java.tuple.Tuple4) StreamExecutionEnvironment(org.apache.flink.streaming.api.environment.StreamExecutionEnvironment)

Aggregations

FileSource (org.apache.flink.connector.file.src.FileSource)3 TextLineInputFormat (org.apache.flink.connector.file.src.reader.TextLineInputFormat)3 StreamExecutionEnvironment (org.apache.flink.streaming.api.environment.StreamExecutionEnvironment)3 CLI (org.apache.flink.streaming.examples.wordcount.util.CLI)3 Tuple2 (org.apache.flink.api.java.tuple.Tuple2)2 Duration (java.time.Duration)1 WatermarkStrategy (org.apache.flink.api.common.eventtime.WatermarkStrategy)1 SimpleStringEncoder (org.apache.flink.api.common.serialization.SimpleStringEncoder)1 Tuple4 (org.apache.flink.api.java.tuple.Tuple4)1 MemorySize (org.apache.flink.configuration.MemorySize)1 FileSink (org.apache.flink.connector.file.sink.FileSink)1 DataStream (org.apache.flink.streaming.api.datastream.DataStream)1 DefaultRollingPolicy (org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy)1 WordCount (org.apache.flink.streaming.examples.wordcount.WordCount)1 WordCountData (org.apache.flink.streaming.examples.wordcount.util.WordCountData)1