Search in sources :

Example 1 with SimpleStringEncoder

use of org.apache.flink.api.common.serialization.SimpleStringEncoder in project flink by apache.

the class WindowWordCount method main.

// *************************************************************************
// PROGRAM
// *************************************************************************
public static void main(String[] args) throws Exception {
    final CLI params = CLI.fromArgs(args);
    // Create the execution environment. This is the main entrypoint
    // to building a Flink application.
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    // Apache Flink’s unified approach to stream and batch processing means that a DataStream
    // application executed over bounded input will produce the same final results regardless
    // of the configured execution mode. It is important to note what final means here: a job
    // executing in STREAMING mode might produce incremental updates (think upserts in
    // a database) while a BATCH job would only produce one final result at the end. The final
    // result will be the same if interpreted correctly, but getting there can be different.
    // 
    // The “classic” execution behavior of the DataStream API is called STREAMING execution
    // mode. Applications should use streaming execution for unbounded jobs that require
    // continuous incremental processing and are expected to stay online indefinitely.
    // 
    // By enabling BATCH execution, we allow Flink to apply additional optimizations that we
    // can only do when we know that our input is bounded. For example, different
    // join/aggregation strategies can be used, in addition to a different shuffle
    // implementation that allows more efficient task scheduling and failure recovery behavior.
    // 
    // By setting the runtime mode to AUTOMATIC, Flink will choose BATCH  if all sources
    // are bounded and otherwise STREAMING.
    env.setRuntimeMode(params.getExecutionMode());
    // This optional step makes the input parameters
    // available in the Flink UI.
    env.getConfig().setGlobalJobParameters(params);
    DataStream<String> text;
    if (params.getInputs().isPresent()) {
        // Create a new file source that will read files from a given set of directories.
        // Each file will be processed as plain text and split based on newlines.
        FileSource.FileSourceBuilder<String> builder = FileSource.forRecordStreamFormat(new TextLineInputFormat(), params.getInputs().get());
        // If a discovery interval is provided, the source will
        // continuously watch the given directories for new files.
        params.getDiscoveryInterval().ifPresent(builder::monitorContinuously);
        text = env.fromSource(builder.build(), WatermarkStrategy.noWatermarks(), "file-input");
    } else {
        text = env.fromElements(WordCountData.WORDS).name("in-memory-input");
    }
    int windowSize = params.getInt("window").orElse(250);
    int slideSize = params.getInt("slide").orElse(150);
    DataStream<Tuple2<String, Integer>> counts = // will output each words as a (2-tuple) containing (word, 1)
    text.flatMap(new WordCount.Tokenizer()).name("tokenizer").keyBy(value -> value.f0).countWindow(windowSize, slideSize).sum(1).name("counter");
    if (params.getOutput().isPresent()) {
        // Given an output directory, Flink will write the results to a file
        // using a simple string encoding. In a production environment, this might
        // be something more structured like CSV, Avro, JSON, or Parquet.
        counts.sinkTo(FileSink.<Tuple2<String, Integer>>forRowFormat(params.getOutput().get(), new SimpleStringEncoder<>()).withRollingPolicy(DefaultRollingPolicy.builder().withMaxPartSize(MemorySize.ofMebiBytes(1)).withRolloverInterval(Duration.ofSeconds(10)).build()).build()).name("file-sink");
    } else {
        counts.print().name("print-sink");
    }
    // Apache Flink applications are composed lazily. Calling execute
    // submits the Job and begins processing.
    env.execute("WindowWordCount");
}
Also used : Tuple2(org.apache.flink.api.java.tuple.Tuple2) WordCount(org.apache.flink.streaming.examples.wordcount.WordCount) WatermarkStrategy(org.apache.flink.api.common.eventtime.WatermarkStrategy) FileSink(org.apache.flink.connector.file.sink.FileSink) MemorySize(org.apache.flink.configuration.MemorySize) FileSource(org.apache.flink.connector.file.src.FileSource) DataStream(org.apache.flink.streaming.api.datastream.DataStream) TextLineInputFormat(org.apache.flink.connector.file.src.reader.TextLineInputFormat) SimpleStringEncoder(org.apache.flink.api.common.serialization.SimpleStringEncoder) DefaultRollingPolicy(org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy) Duration(java.time.Duration) CLI(org.apache.flink.streaming.examples.wordcount.util.CLI) StreamExecutionEnvironment(org.apache.flink.streaming.api.environment.StreamExecutionEnvironment) WordCountData(org.apache.flink.streaming.examples.wordcount.util.WordCountData) CLI(org.apache.flink.streaming.examples.wordcount.util.CLI) TextLineInputFormat(org.apache.flink.connector.file.src.reader.TextLineInputFormat) FileSource(org.apache.flink.connector.file.src.FileSource) Tuple2(org.apache.flink.api.java.tuple.Tuple2) StreamExecutionEnvironment(org.apache.flink.streaming.api.environment.StreamExecutionEnvironment)

Example 2 with SimpleStringEncoder

use of org.apache.flink.api.common.serialization.SimpleStringEncoder in project flink by apache.

the class StateMachineExample method main.

/**
 * Main entry point for the program.
 *
 * @param args The command line arguments.
 */
public static void main(String[] args) throws Exception {
    // ---- print some usage help ----
    System.out.println("Usage with built-in data generator: StateMachineExample [--error-rate <probability-of-invalid-transition>] [--sleep <sleep-per-record-in-ms>]");
    System.out.println("Usage with Kafka: StateMachineExample --kafka-topic <topic> [--brokers <brokers>]");
    System.out.println("Options for both the above setups: ");
    System.out.println("\t[--backend <hashmap|rocks>]");
    System.out.println("\t[--checkpoint-dir <filepath>]");
    System.out.println("\t[--incremental-checkpoints <true|false>]");
    System.out.println("\t[--output <filepath> OR null for stdout]");
    System.out.println();
    // ---- determine whether to use the built-in source, or read from Kafka ----
    final DataStream<Event> events;
    final ParameterTool params = ParameterTool.fromArgs(args);
    // create the environment to create streams and configure execution
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.enableCheckpointing(2000L);
    final String stateBackend = params.get("backend", "memory");
    if ("hashmap".equals(stateBackend)) {
        final String checkpointDir = params.get("checkpoint-dir");
        env.setStateBackend(new HashMapStateBackend());
        env.getCheckpointConfig().setCheckpointStorage(checkpointDir);
    } else if ("rocks".equals(stateBackend)) {
        final String checkpointDir = params.get("checkpoint-dir");
        boolean incrementalCheckpoints = params.getBoolean("incremental-checkpoints", false);
        env.setStateBackend(new EmbeddedRocksDBStateBackend(incrementalCheckpoints));
        env.getCheckpointConfig().setCheckpointStorage(checkpointDir);
    }
    if (params.has("kafka-topic")) {
        // set up the Kafka reader
        String kafkaTopic = params.get("kafka-topic");
        String brokers = params.get("brokers", "localhost:9092");
        System.out.printf("Reading from kafka topic %s @ %s\n", kafkaTopic, brokers);
        System.out.println();
        KafkaSource<Event> source = KafkaSource.<Event>builder().setBootstrapServers(brokers).setGroupId("stateMachineExample").setTopics(kafkaTopic).setDeserializer(KafkaRecordDeserializationSchema.valueOnly(new EventDeSerializationSchema())).setStartingOffsets(OffsetsInitializer.latest()).build();
        events = env.fromSource(source, WatermarkStrategy.noWatermarks(), "StateMachineExampleSource");
    } else {
        double errorRate = params.getDouble("error-rate", 0.0);
        int sleep = params.getInt("sleep", 1);
        System.out.printf("Using standalone source with error rate %f and sleep delay %s millis\n", errorRate, sleep);
        System.out.println();
        events = env.addSource(new EventsGeneratorSource(errorRate, sleep));
    }
    // ---- main program ----
    final String outputFile = params.get("output");
    // make parameters available in the web interface
    env.getConfig().setGlobalJobParameters(params);
    DataStream<Alert> alerts = events.keyBy(Event::sourceAddress).flatMap(new StateMachineMapper());
    // output the alerts to std-out
    if (outputFile == null) {
        alerts.print();
    } else {
        alerts.sinkTo(FileSink.<Alert>forRowFormat(new Path(outputFile), new SimpleStringEncoder<>()).withRollingPolicy(DefaultRollingPolicy.builder().withMaxPartSize(MemorySize.ofMebiBytes(1)).withRolloverInterval(Duration.ofSeconds(10)).build()).build()).setParallelism(1).name("output");
    }
    // trigger program execution
    env.execute("State machine job");
}
Also used : ParameterTool(org.apache.flink.api.java.utils.ParameterTool) Path(org.apache.flink.core.fs.Path) EmbeddedRocksDBStateBackend(org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend) EventsGeneratorSource(org.apache.flink.streaming.examples.statemachine.generator.EventsGeneratorSource) Event(org.apache.flink.streaming.examples.statemachine.event.Event) StreamExecutionEnvironment(org.apache.flink.streaming.api.environment.StreamExecutionEnvironment) HashMapStateBackend(org.apache.flink.runtime.state.hashmap.HashMapStateBackend) Alert(org.apache.flink.streaming.examples.statemachine.event.Alert) EventDeSerializationSchema(org.apache.flink.streaming.examples.statemachine.kafka.EventDeSerializationSchema) SimpleStringEncoder(org.apache.flink.api.common.serialization.SimpleStringEncoder)

Example 3 with SimpleStringEncoder

use of org.apache.flink.api.common.serialization.SimpleStringEncoder in project flink by apache.

the class CompactFileWriterTest method testEmitEndCheckpointAfterEndInput.

@Test
public void testEmitEndCheckpointAfterEndInput() throws Exception {
    CompactFileWriter<RowData> compactFileWriter = new CompactFileWriter<>(1000, StreamingFileSink.forRowFormat(folder, new SimpleStringEncoder<>()));
    try (OneInputStreamOperatorTestHarness<RowData, CoordinatorInput> harness = new OneInputStreamOperatorTestHarness<>(compactFileWriter)) {
        harness.setup();
        harness.open();
        harness.processElement(row("test"), 0);
        harness.snapshot(1, 1);
        harness.notifyOfCompletedCheckpoint(1);
        List<CoordinatorInput> coordinatorInputs = harness.extractOutputValues();
        Assert.assertEquals(2, coordinatorInputs.size());
        // assert emit InputFile
        Assert.assertTrue(coordinatorInputs.get(0) instanceof InputFile);
        // assert emit EndCheckpoint
        Assert.assertEquals(1, ((EndCheckpoint) coordinatorInputs.get(1)).getCheckpointId());
        harness.processElement(row("test1"), 0);
        harness.processElement(row("test2"), 0);
        harness.getOutput().clear();
        // end input
        harness.endInput();
        coordinatorInputs = harness.extractOutputValues();
        // assert emit EndCheckpoint with Long.MAX_VALUE lastly
        EndCheckpoint endCheckpoint = (EndCheckpoint) coordinatorInputs.get(coordinatorInputs.size() - 1);
        Assert.assertEquals(Long.MAX_VALUE, endCheckpoint.getCheckpointId());
    }
}
Also used : GenericRowData(org.apache.flink.table.data.GenericRowData) RowData(org.apache.flink.table.data.RowData) CoordinatorInput(org.apache.flink.connector.file.table.stream.compact.CompactMessages.CoordinatorInput) EndCheckpoint(org.apache.flink.connector.file.table.stream.compact.CompactMessages.EndCheckpoint) OneInputStreamOperatorTestHarness(org.apache.flink.streaming.util.OneInputStreamOperatorTestHarness) SimpleStringEncoder(org.apache.flink.api.common.serialization.SimpleStringEncoder) InputFile(org.apache.flink.connector.file.table.stream.compact.CompactMessages.InputFile) Test(org.junit.Test)

Example 4 with SimpleStringEncoder

use of org.apache.flink.api.common.serialization.SimpleStringEncoder in project flink by apache.

the class BucketAssignerITCases method testAssembleBucketPath.

@Test
public void testAssembleBucketPath() throws Exception {
    final File outDir = TEMP_FOLDER.newFolder();
    final Path basePath = new Path(outDir.toURI());
    final long time = 1000L;
    final RollingPolicy<String, String> rollingPolicy = DefaultRollingPolicy.builder().withMaxPartSize(new MemorySize(7L)).build();
    final Buckets<String, String> buckets = new Buckets<>(basePath, new BasePathBucketAssigner<>(), new DefaultBucketFactoryImpl<>(), new RowWiseBucketWriter<>(FileSystem.get(basePath.toUri()).createRecoverableWriter(), new SimpleStringEncoder<>()), rollingPolicy, 0, OutputFileConfig.builder().build());
    Bucket<String, String> bucket = buckets.onElement("abc", new TestUtils.MockSinkContext(time, time, time));
    Assert.assertEquals(new Path(basePath.toUri()), bucket.getBucketPath());
}
Also used : Path(org.apache.flink.core.fs.Path) MemorySize(org.apache.flink.configuration.MemorySize) SimpleStringEncoder(org.apache.flink.api.common.serialization.SimpleStringEncoder) File(java.io.File) Test(org.junit.Test)

Example 5 with SimpleStringEncoder

use of org.apache.flink.api.common.serialization.SimpleStringEncoder in project flink by apache.

the class BucketsTest method testCorrectTimestampPassingInContext.

private void testCorrectTimestampPassingInContext(Long timestamp, long watermark, long processingTime) throws Exception {
    final File outDir = TEMP_FOLDER.newFolder();
    final Path path = new Path(outDir.toURI());
    final Buckets<String, String> buckets = new Buckets<>(path, new VerifyingBucketAssigner(timestamp, watermark, processingTime), new DefaultBucketFactoryImpl<>(), new RowWiseBucketWriter<>(FileSystem.get(path.toUri()).createRecoverableWriter(), new SimpleStringEncoder<>()), DefaultRollingPolicy.builder().build(), 2, OutputFileConfig.builder().build());
    buckets.onElement("test", new TestUtils.MockSinkContext(timestamp, watermark, processingTime));
}
Also used : Path(org.apache.flink.core.fs.Path) SimpleStringEncoder(org.apache.flink.api.common.serialization.SimpleStringEncoder) File(java.io.File)

Aggregations

SimpleStringEncoder (org.apache.flink.api.common.serialization.SimpleStringEncoder)7 Path (org.apache.flink.core.fs.Path)5 StreamExecutionEnvironment (org.apache.flink.streaming.api.environment.StreamExecutionEnvironment)4 File (java.io.File)3 Test (org.junit.Test)3 Tuple2 (org.apache.flink.api.java.tuple.Tuple2)2 ParameterTool (org.apache.flink.api.java.utils.ParameterTool)2 MemorySize (org.apache.flink.configuration.MemorySize)2 Duration (java.time.Duration)1 WatermarkStrategy (org.apache.flink.api.common.eventtime.WatermarkStrategy)1 Tuple5 (org.apache.flink.api.java.tuple.Tuple5)1 FileSink (org.apache.flink.connector.file.sink.FileSink)1 FileSource (org.apache.flink.connector.file.src.FileSource)1 TextLineInputFormat (org.apache.flink.connector.file.src.reader.TextLineInputFormat)1 CoordinatorInput (org.apache.flink.connector.file.table.stream.compact.CompactMessages.CoordinatorInput)1 EndCheckpoint (org.apache.flink.connector.file.table.stream.compact.CompactMessages.EndCheckpoint)1 InputFile (org.apache.flink.connector.file.table.stream.compact.CompactMessages.InputFile)1 EmbeddedRocksDBStateBackend (org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend)1 HashMapStateBackend (org.apache.flink.runtime.state.hashmap.HashMapStateBackend)1 DataStream (org.apache.flink.streaming.api.datastream.DataStream)1