Examples with DataSource - org.apache.flink.api.java.operators.DataSource

Example 1 with DataSource

use of org.apache.flink.api.java.operators.DataSource in project flink by apache.

the class ExecutionEnvironment method readTextFileWithValue.

/**
	 * Creates a {@link DataSet} that represents the Strings produced by reading the given file line wise.
	 * This method is similar to {@link #readTextFile(String, String)}, but it produces a DataSet with mutable
	 * {@link StringValue} objects, rather than Java Strings. StringValues can be used to tune implementations
	 * to be less object and garbage collection heavy.
	 * <p>
	 * The {@link java.nio.charset.Charset} with the given name will be used to read the files.
	 *
	 * @param filePath The path of the file, as a URI (e.g., "file:///some/local/file" or "hdfs://host:port/file/path").
	 * @param charsetName The name of the character set used to read the file.
	 * @param skipInvalidLines A flag to indicate whether to skip lines that cannot be read with the given character set.
	 *
	 * @return A DataSet that represents the data read from the given file as text lines.
	 */
public DataSource<StringValue> readTextFileWithValue(String filePath, String charsetName, boolean skipInvalidLines) {
    Preconditions.checkNotNull(filePath, "The file path may not be null.");
    TextValueInputFormat format = new TextValueInputFormat(new Path(filePath));
    format.setCharsetName(charsetName);
    format.setSkipInvalidLines(skipInvalidLines);
    return new DataSource<>(this, format, new ValueTypeInfo<>(StringValue.class), Utils.getCallLocationName());
}

Also used : Path(org.apache.flink.core.fs.Path) TextValueInputFormat(org.apache.flink.api.java.io.TextValueInputFormat) StringValue(org.apache.flink.types.StringValue) DataSource(org.apache.flink.api.java.operators.DataSource)

Example 2 with DataSource

use of org.apache.flink.api.java.operators.DataSource in project flink by apache.

the class DistCp method main.

public static void main(String[] args) throws Exception {
    // set up the execution environment
    final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    ParameterTool params = ParameterTool.fromArgs(args);
    if (!params.has("input") || !params.has("output")) {
        System.err.println("Usage: --input <path> --output <path> [--parallelism <n>]");
        return;
    }
    final Path sourcePath = new Path(params.get("input"));
    final Path targetPath = new Path(params.get("output"));
    if (!isLocal(env) && !(isOnDistributedFS(sourcePath) && isOnDistributedFS(targetPath))) {
        System.out.println("In a distributed mode only HDFS input/output paths are supported");
        return;
    }
    final int parallelism = params.getInt("parallelism", 10);
    if (parallelism <= 0) {
        System.err.println("Parallelism should be greater than 0");
        return;
    }
    // make parameters available in the web interface
    env.getConfig().setGlobalJobParameters(params);
    env.setParallelism(parallelism);
    long startTime = System.currentTimeMillis();
    LOGGER.info("Initializing copy tasks");
    List<FileCopyTask> tasks = getCopyTasks(sourcePath);
    LOGGER.info("Copy task initialization took " + (System.currentTimeMillis() - startTime) + "ms");
    DataSet<FileCopyTask> inputTasks = new DataSource<>(env, new FileCopyTaskInputFormat(tasks), new GenericTypeInfo<>(FileCopyTask.class), "fileCopyTasks");
    FlatMapOperator<FileCopyTask, Object> res = inputTasks.flatMap(new RichFlatMapFunction<FileCopyTask, Object>() {

        private static final long serialVersionUID = 1109254230243989929L;

        private LongCounter fileCounter;

        private LongCounter bytesCounter;

        @Override
        public void open(Configuration parameters) throws Exception {
            bytesCounter = getRuntimeContext().getLongCounter(BYTES_COPIED_CNT_NAME);
            fileCounter = getRuntimeContext().getLongCounter(FILES_COPIED_CNT_NAME);
        }

        @Override
        public void flatMap(FileCopyTask task, Collector<Object> out) throws Exception {
            LOGGER.info("Processing task: " + task);
            Path outPath = new Path(targetPath, task.getRelativePath());
            FileSystem targetFs = targetPath.getFileSystem();
            // creating parent folders in case of a local FS
            if (!targetFs.isDistributedFS()) {
                //dealing with cases like file:///tmp or just /tmp
                File outFile = outPath.toUri().isAbsolute() ? new File(outPath.toUri()) : new File(outPath.toString());
                File parentFile = outFile.getParentFile();
                if (!parentFile.mkdirs() && !parentFile.exists()) {
                    throw new RuntimeException("Cannot create local file system directories: " + parentFile);
                }
            }
            FSDataOutputStream outputStream = null;
            FSDataInputStream inputStream = null;
            try {
                outputStream = targetFs.create(outPath, true);
                inputStream = task.getPath().getFileSystem().open(task.getPath());
                int bytes = IOUtils.copy(inputStream, outputStream);
                bytesCounter.add(bytes);
            } finally {
                IOUtils.closeQuietly(inputStream);
                IOUtils.closeQuietly(outputStream);
            }
            fileCounter.add(1l);
        }
    });
    // no data sinks are needed, therefore just printing an empty result
    res.print();
    Map<String, Object> accumulators = env.getLastJobExecutionResult().getAllAccumulatorResults();
    LOGGER.info("== COUNTERS ==");
    for (Map.Entry<String, Object> e : accumulators.entrySet()) {
        LOGGER.info(e.getKey() + ": " + e.getValue());
    }
}

Also used : ParameterTool(org.apache.flink.api.java.utils.ParameterTool) ExecutionEnvironment(org.apache.flink.api.java.ExecutionEnvironment) Configuration(org.apache.flink.configuration.Configuration) LongCounter(org.apache.flink.api.common.accumulators.LongCounter) FileSystem(org.apache.flink.core.fs.FileSystem) FSDataOutputStream(org.apache.flink.core.fs.FSDataOutputStream) Path(org.apache.flink.core.fs.Path) IOException(java.io.IOException) DataSource(org.apache.flink.api.java.operators.DataSource) FSDataInputStream(org.apache.flink.core.fs.FSDataInputStream) File(java.io.File) Map(java.util.Map)

Example 3 with DataSource

use of org.apache.flink.api.java.operators.DataSource in project flink by apache.

the class ExecutionEnvironment method readTextFile.

/**
	 * Creates a {@link DataSet} that represents the Strings produced by reading the given file line wise.
	 * The {@link java.nio.charset.Charset} with the given name will be used to read the files.
	 *
	 * @param filePath The path of the file, as a URI (e.g., "file:///some/local/file" or "hdfs://host:port/file/path").
	 * @param charsetName The name of the character set used to read the file.
	 * @return A {@link DataSet} that represents the data read from the given file as text lines.
	 */
public DataSource<String> readTextFile(String filePath, String charsetName) {
    Preconditions.checkNotNull(filePath, "The file path may not be null.");
    TextInputFormat format = new TextInputFormat(new Path(filePath));
    format.setCharsetName(charsetName);
    return new DataSource<>(this, format, BasicTypeInfo.STRING_TYPE_INFO, Utils.getCallLocationName());
}

Also used : Path(org.apache.flink.core.fs.Path) TextInputFormat(org.apache.flink.api.java.io.TextInputFormat) DataSource(org.apache.flink.api.java.operators.DataSource)

Aggregations

DataSource (org.apache.flink.api.java.operators.DataSource)3 Path (org.apache.flink.core.fs.Path)3 File (java.io.File)1 IOException (java.io.IOException)1 Map (java.util.Map)1 LongCounter (org.apache.flink.api.common.accumulators.LongCounter)1 ExecutionEnvironment (org.apache.flink.api.java.ExecutionEnvironment)1 TextInputFormat (org.apache.flink.api.java.io.TextInputFormat)1 TextValueInputFormat (org.apache.flink.api.java.io.TextValueInputFormat)1 ParameterTool (org.apache.flink.api.java.utils.ParameterTool)1 Configuration (org.apache.flink.configuration.Configuration)1 FSDataInputStream (org.apache.flink.core.fs.FSDataInputStream)1 FSDataOutputStream (org.apache.flink.core.fs.FSDataOutputStream)1 FileSystem (org.apache.flink.core.fs.FileSystem)1 StringValue (org.apache.flink.types.StringValue)1