use of org.apache.flink.connector.file.src.reader.TextLineInputFormat in project flink by apache.
the class WindowWordCount method main.
// *************************************************************************
// PROGRAM
// *************************************************************************
public static void main(String[] args) throws Exception {
final CLI params = CLI.fromArgs(args);
// Create the execution environment. This is the main entrypoint
// to building a Flink application.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Apache Flink’s unified approach to stream and batch processing means that a DataStream
// application executed over bounded input will produce the same final results regardless
// of the configured execution mode. It is important to note what final means here: a job
// executing in STREAMING mode might produce incremental updates (think upserts in
// a database) while a BATCH job would only produce one final result at the end. The final
// result will be the same if interpreted correctly, but getting there can be different.
//
// The “classic” execution behavior of the DataStream API is called STREAMING execution
// mode. Applications should use streaming execution for unbounded jobs that require
// continuous incremental processing and are expected to stay online indefinitely.
//
// By enabling BATCH execution, we allow Flink to apply additional optimizations that we
// can only do when we know that our input is bounded. For example, different
// join/aggregation strategies can be used, in addition to a different shuffle
// implementation that allows more efficient task scheduling and failure recovery behavior.
//
// By setting the runtime mode to AUTOMATIC, Flink will choose BATCH if all sources
// are bounded and otherwise STREAMING.
env.setRuntimeMode(params.getExecutionMode());
// This optional step makes the input parameters
// available in the Flink UI.
env.getConfig().setGlobalJobParameters(params);
DataStream<String> text;
if (params.getInputs().isPresent()) {
// Create a new file source that will read files from a given set of directories.
// Each file will be processed as plain text and split based on newlines.
FileSource.FileSourceBuilder<String> builder = FileSource.forRecordStreamFormat(new TextLineInputFormat(), params.getInputs().get());
// If a discovery interval is provided, the source will
// continuously watch the given directories for new files.
params.getDiscoveryInterval().ifPresent(builder::monitorContinuously);
text = env.fromSource(builder.build(), WatermarkStrategy.noWatermarks(), "file-input");
} else {
text = env.fromElements(WordCountData.WORDS).name("in-memory-input");
}
int windowSize = params.getInt("window").orElse(250);
int slideSize = params.getInt("slide").orElse(150);
DataStream<Tuple2<String, Integer>> counts = // will output each words as a (2-tuple) containing (word, 1)
text.flatMap(new WordCount.Tokenizer()).name("tokenizer").keyBy(value -> value.f0).countWindow(windowSize, slideSize).sum(1).name("counter");
if (params.getOutput().isPresent()) {
// Given an output directory, Flink will write the results to a file
// using a simple string encoding. In a production environment, this might
// be something more structured like CSV, Avro, JSON, or Parquet.
counts.sinkTo(FileSink.<Tuple2<String, Integer>>forRowFormat(params.getOutput().get(), new SimpleStringEncoder<>()).withRollingPolicy(DefaultRollingPolicy.builder().withMaxPartSize(MemorySize.ofMebiBytes(1)).withRolloverInterval(Duration.ofSeconds(10)).build()).build()).name("file-sink");
} else {
counts.print().name("print-sink");
}
// Apache Flink applications are composed lazily. Calling execute
// submits the Job and begins processing.
env.execute("WindowWordCount");
}
use of org.apache.flink.connector.file.src.reader.TextLineInputFormat in project flink by apache.
the class WordCount method main.
// *************************************************************************
// PROGRAM
// *************************************************************************
public static void main(String[] args) throws Exception {
final CLI params = CLI.fromArgs(args);
// Create the execution environment. This is the main entrypoint
// to building a Flink application.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Apache Flink’s unified approach to stream and batch processing means that a DataStream
// application executed over bounded input will produce the same final results regardless
// of the configured execution mode. It is important to note what final means here: a job
// executing in STREAMING mode might produce incremental updates (think upserts in
// a database) while in BATCH mode, it would only produce one final result at the end. The
// final result will be the same if interpreted correctly, but getting there can be
// different.
//
// The “classic” execution behavior of the DataStream API is called STREAMING execution
// mode. Applications should use streaming execution for unbounded jobs that require
// continuous incremental processing and are expected to stay online indefinitely.
//
// By enabling BATCH execution, we allow Flink to apply additional optimizations that we
// can only do when we know that our input is bounded. For example, different
// join/aggregation strategies can be used, in addition to a different shuffle
// implementation that allows more efficient task scheduling and failure recovery behavior.
//
// By setting the runtime mode to AUTOMATIC, Flink will choose BATCH if all sources
// are bounded and otherwise STREAMING.
env.setRuntimeMode(params.getExecutionMode());
// This optional step makes the input parameters
// available in the Flink UI.
env.getConfig().setGlobalJobParameters(params);
DataStream<String> text;
if (params.getInputs().isPresent()) {
// Create a new file source that will read files from a given set of directories.
// Each file will be processed as plain text and split based on newlines.
FileSource.FileSourceBuilder<String> builder = FileSource.forRecordStreamFormat(new TextLineInputFormat(), params.getInputs().get());
// If a discovery interval is provided, the source will
// continuously watch the given directories for new files.
params.getDiscoveryInterval().ifPresent(builder::monitorContinuously);
text = env.fromSource(builder.build(), WatermarkStrategy.noWatermarks(), "file-input");
} else {
text = env.fromElements(WordCountData.WORDS).name("in-memory-input");
}
DataStream<Tuple2<String, Integer>> counts = // will output each word as a (2-tuple) containing (word, 1)
text.flatMap(new Tokenizer()).name("tokenizer").keyBy(value -> value.f0).sum(1).name("counter");
if (params.getOutput().isPresent()) {
// Given an output directory, Flink will write the results to a file
// using a simple string encoding. In a production environment, this might
// be something more structured like CSV, Avro, JSON, or Parquet.
counts.sinkTo(FileSink.<Tuple2<String, Integer>>forRowFormat(params.getOutput().get(), new SimpleStringEncoder<>()).withRollingPolicy(DefaultRollingPolicy.builder().withMaxPartSize(MemorySize.ofMebiBytes(1)).withRolloverInterval(Duration.ofSeconds(10)).build()).build()).name("file-sink");
} else {
counts.print().name("print-sink");
}
// Apache Flink applications are composed lazily. Calling execute
// submits the Job and begins processing.
env.execute("WordCount");
}
use of org.apache.flink.connector.file.src.reader.TextLineInputFormat in project flink by apache.
the class TopSpeedWindowing method main.
// *************************************************************************
// PROGRAM
// *************************************************************************
public static void main(String[] args) throws Exception {
final CLI params = CLI.fromArgs(args);
// Create the execution environment. This is the main entrypoint
// to building a Flink application.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Apache Flink’s unified approach to stream and batch processing means that a DataStream
// application executed over bounded input will produce the same final results regardless
// of the configured execution mode. It is important to note what final means here: a job
// executing in STREAMING mode might produce incremental updates (think upserts in
// a database) while a BATCH job would only produce one final result at the end. The final
// result will be the same if interpreted correctly, but getting there can be different.
//
// The “classic” execution behavior of the DataStream API is called STREAMING execution
// mode. Applications should use streaming execution for unbounded jobs that require
// continuous incremental processing and are expected to stay online indefinitely.
//
// By enabling BATCH execution, we allow Flink to apply additional optimizations that we
// can only do when we know that our input is bounded. For example, different
// join/aggregation strategies can be used, in addition to a different shuffle
// implementation that allows more efficient task scheduling and failure recovery behavior.
//
// By setting the runtime mode to AUTOMATIC, Flink will choose BATCH if all sources
// are bounded and otherwise STREAMING.
env.setRuntimeMode(params.getExecutionMode());
// This optional step makes the input parameters
// available in the Flink UI.
env.getConfig().setGlobalJobParameters(params);
DataStream<Tuple4<Integer, Integer, Double, Long>> carData;
if (params.getInputs().isPresent()) {
// Create a new file source that will read files from a given set of directories.
// Each file will be processed as plain text and split based on newlines.
FileSource.FileSourceBuilder<String> builder = FileSource.forRecordStreamFormat(new TextLineInputFormat(), params.getInputs().get());
// If a discovery interval is provided, the source will
// continuously watch the given directories for new files.
params.getDiscoveryInterval().ifPresent(builder::monitorContinuously);
carData = env.fromSource(builder.build(), WatermarkStrategy.noWatermarks(), "file-input").map(new ParseCarData()).name("parse-input");
} else {
carData = env.addSource(CarSource.create(2)).name("in-memory-source");
}
int evictionSec = 10;
double triggerMeters = 50;
DataStream<Tuple4<Integer, Integer, Double, Long>> topSpeeds = carData.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple4<Integer, Integer, Double, Long>>forMonotonousTimestamps().withTimestampAssigner((car, ts) -> car.f3)).keyBy(value -> value.f0).window(GlobalWindows.create()).evictor(TimeEvictor.of(Time.of(evictionSec, TimeUnit.SECONDS))).trigger(DeltaTrigger.of(triggerMeters, new DeltaFunction<Tuple4<Integer, Integer, Double, Long>>() {
private static final long serialVersionUID = 1L;
@Override
public double getDelta(Tuple4<Integer, Integer, Double, Long> oldDataPoint, Tuple4<Integer, Integer, Double, Long> newDataPoint) {
return newDataPoint.f2 - oldDataPoint.f2;
}
}, carData.getType().createSerializer(env.getConfig()))).maxBy(1);
if (params.getOutput().isPresent()) {
// Given an output directory, Flink will write the results to a file
// using a simple string encoding. In a production environment, this might
// be something more structured like CSV, Avro, JSON, or Parquet.
topSpeeds.sinkTo(FileSink.<Tuple4<Integer, Integer, Double, Long>>forRowFormat(params.getOutput().get(), new SimpleStringEncoder<>()).withRollingPolicy(DefaultRollingPolicy.builder().withMaxPartSize(MemorySize.ofMebiBytes(1)).withRolloverInterval(Duration.ofSeconds(10)).build()).build()).name("file-sink");
} else {
topSpeeds.print();
}
env.execute("CarTopSpeedWindowingExample");
}
use of org.apache.flink.connector.file.src.reader.TextLineInputFormat in project flink by apache.
the class LimitableBulkFormatTest method testLimitOverBatches.
@Test
public void testLimitOverBatches() throws IOException {
// set limit
Long limit = 2048L;
// configuration for small batches
Configuration conf = new Configuration();
conf.set(StreamFormat.FETCH_IO_SIZE, MemorySize.parse("4k"));
// read
BulkFormat<String, FileSourceSplit> format = LimitableBulkFormat.create(new StreamFormatAdapter<>(new TextLineInputFormat()), limit);
BulkFormat.Reader<String> reader = format.createReader(conf, new FileSourceSplit("id", new Path(file.toURI()), 0, file.length(), file.lastModified(), file.length()));
// check
AtomicInteger i = new AtomicInteger(0);
Utils.forEachRemaining(reader, s -> i.incrementAndGet());
Assert.assertEquals(limit.intValue(), i.get());
}
use of org.apache.flink.connector.file.src.reader.TextLineInputFormat in project flink by apache.
the class FileSourceTextLinesITCase method testBoundedTextFileSource.
private void testBoundedTextFileSource(FailoverType failoverType) throws Exception {
final File testDir = TMP_FOLDER.newFolder();
// our main test data
writeAllFiles(testDir);
// write some junk to hidden files test that common hidden file patterns are filtered by
// default
writeHiddenJunkFiles(testDir);
final FileSource<String> source = FileSource.forRecordStreamFormat(new TextLineInputFormat(), Path.fromLocalFile(testDir)).build();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(PARALLELISM);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(1, 0));
final DataStream<String> stream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
final DataStream<String> streamFailingInTheMiddleOfReading = RecordCounterToFail.wrapWithFailureAfter(stream, LINES.length / 2);
final ClientAndIterator<String> client = DataStreamUtils.collectWithClient(streamFailingInTheMiddleOfReading, "Bounded TextFiles Test");
final JobID jobId = client.client.getJobID();
RecordCounterToFail.waitToFail();
triggerFailover(failoverType, jobId, RecordCounterToFail::continueProcessing, miniClusterResource.getMiniCluster());
final List<String> result = new ArrayList<>();
while (client.iterator.hasNext()) {
result.add(client.iterator.next());
}
verifyResult(result);
}
Aggregations