Search in sources :

Example 1 with MatchResult

use of org.apache.beam.sdk.io.fs.MatchResult in project beam by apache.

the class HadoopFileSystem method match.

@Override
protected List<MatchResult> match(List<String> specs) {
    ImmutableList.Builder<MatchResult> resultsBuilder = ImmutableList.builder();
    for (String spec : specs) {
        try {
            FileStatus[] fileStatuses = fileSystem.globStatus(new Path(spec));
            if (fileStatuses == null) {
                resultsBuilder.add(MatchResult.create(Status.NOT_FOUND, Collections.<Metadata>emptyList()));
                continue;
            }
            List<Metadata> metadata = new ArrayList<>();
            for (FileStatus fileStatus : fileStatuses) {
                if (fileStatus.isFile()) {
                    URI uri = dropEmptyAuthority(fileStatus.getPath().toUri().toString());
                    metadata.add(Metadata.builder().setResourceId(new HadoopResourceId(uri)).setIsReadSeekEfficient(true).setSizeBytes(fileStatus.getLen()).build());
                }
            }
            resultsBuilder.add(MatchResult.create(Status.OK, metadata));
        } catch (IOException e) {
            resultsBuilder.add(MatchResult.create(Status.ERROR, e));
        }
    }
    return resultsBuilder.build();
}
Also used : Path(org.apache.hadoop.fs.Path) FileStatus(org.apache.hadoop.fs.FileStatus) ImmutableList(com.google.common.collect.ImmutableList) Metadata(org.apache.beam.sdk.io.fs.MatchResult.Metadata) ArrayList(java.util.ArrayList) IOException(java.io.IOException) MatchResult(org.apache.beam.sdk.io.fs.MatchResult) URI(java.net.URI)

Example 2 with MatchResult

use of org.apache.beam.sdk.io.fs.MatchResult in project beam by apache.

the class GcsFileSystem method match.

@Override
protected List<MatchResult> match(List<String> specs) throws IOException {
    List<GcsPath> gcsPaths = toGcsPaths(specs);
    List<GcsPath> globs = Lists.newArrayList();
    List<GcsPath> nonGlobs = Lists.newArrayList();
    List<Boolean> isGlobBooleans = Lists.newArrayList();
    for (GcsPath path : gcsPaths) {
        if (GcsUtil.isWildcard(path)) {
            globs.add(path);
            isGlobBooleans.add(true);
        } else {
            nonGlobs.add(path);
            isGlobBooleans.add(false);
        }
    }
    Iterator<MatchResult> globsMatchResults = matchGlobs(globs).iterator();
    Iterator<MatchResult> nonGlobsMatchResults = matchNonGlobs(nonGlobs).iterator();
    ImmutableList.Builder<MatchResult> ret = ImmutableList.builder();
    for (Boolean isGlob : isGlobBooleans) {
        if (isGlob) {
            checkState(globsMatchResults.hasNext(), "Expect globsMatchResults has next.");
            ret.add(globsMatchResults.next());
        } else {
            checkState(nonGlobsMatchResults.hasNext(), "Expect nonGlobsMatchResults has next.");
            ret.add(nonGlobsMatchResults.next());
        }
    }
    checkState(!globsMatchResults.hasNext(), "Expect no more elements in globsMatchResults.");
    checkState(!nonGlobsMatchResults.hasNext(), "Expect no more elements in nonGlobsMatchResults.");
    return ret.build();
}
Also used : ImmutableList(com.google.common.collect.ImmutableList) GcsPath(org.apache.beam.sdk.util.gcsfs.GcsPath) MatchResult(org.apache.beam.sdk.io.fs.MatchResult)

Example 3 with MatchResult

use of org.apache.beam.sdk.io.fs.MatchResult in project beam by apache.

the class FileSystems method matchSingleFileSpec.

/**
   * Returns the {@link Metadata} for a single file resource. Expects a resource specification
   * {@code spec} that matches a single result.
   *
   * @param spec a resource specification that matches exactly one result.
   * @return the {@link Metadata} for the specified resource.
   * @throws FileNotFoundException if the file resource is not found.
   * @throws IOException in the event of an error in the inner call to {@link #match},
   * or if the given spec does not match exactly 1 result.
   */
public static Metadata matchSingleFileSpec(String spec) throws IOException {
    List<MatchResult> matches = FileSystems.match(Collections.singletonList(spec));
    MatchResult matchResult = Iterables.getOnlyElement(matches);
    if (matchResult.status() == Status.NOT_FOUND) {
        throw new FileNotFoundException(String.format("File spec %s not found", spec));
    } else if (matchResult.status() != Status.OK) {
        throw new IOException(String.format("Error matching file spec %s: status %s", spec, matchResult.status()));
    } else {
        List<Metadata> metadata = matchResult.metadata();
        if (metadata.size() != 1) {
            throw new IOException(String.format("Expecting spec %s to match exactly one file, but matched %s: %s", spec, metadata.size(), metadata));
        }
        return metadata.get(0);
    }
}
Also used : FileNotFoundException(java.io.FileNotFoundException) ArrayList(java.util.ArrayList) List(java.util.List) IOException(java.io.IOException) MatchResult(org.apache.beam.sdk.io.fs.MatchResult)

Example 4 with MatchResult

use of org.apache.beam.sdk.io.fs.MatchResult in project beam by apache.

the class FileBasedSource method getEstimatedSizeBytes.

@Override
public final long getEstimatedSizeBytes(PipelineOptions options) throws IOException {
    // This implementation of method getEstimatedSizeBytes is provided to simplify subclasses. Here
    // we perform the size estimation of files and file patterns using the interface provided by
    // FileSystem.
    checkState(fileOrPatternSpec.isAccessible(), "Cannot estimate size of a FileBasedSource with inaccessible file pattern: {}.", fileOrPatternSpec);
    String fileOrPattern = fileOrPatternSpec.get();
    if (mode == Mode.FILEPATTERN) {
        long totalSize = 0;
        List<MatchResult> inputs = FileSystems.match(Collections.singletonList(fileOrPattern));
        MatchResult result = Iterables.getOnlyElement(inputs);
        checkArgument(result.status() == Status.OK, "Error matching the pattern or glob %s: status %s", fileOrPattern, result.status());
        List<Metadata> allMatches = result.metadata();
        for (Metadata metadata : allMatches) {
            totalSize += metadata.sizeBytes();
        }
        LOG.info("Filepattern {} matched {} files with total size {}", fileOrPattern, allMatches.size(), totalSize);
        return totalSize;
    } else {
        long start = getStartOffset();
        long end = Math.min(getEndOffset(), getMaxEndOffset(options));
        return end - start;
    }
}
Also used : Metadata(org.apache.beam.sdk.io.fs.MatchResult.Metadata) MatchResult(org.apache.beam.sdk.io.fs.MatchResult)

Example 5 with MatchResult

use of org.apache.beam.sdk.io.fs.MatchResult in project beam by apache.

the class FileBasedSource method expandFilePattern.

private static List<Metadata> expandFilePattern(String fileOrPatternSpec) throws IOException {
    MatchResult matches = Iterables.getOnlyElement(FileSystems.match(Collections.singletonList(fileOrPatternSpec)));
    LOG.info("Matched {} files for pattern {}", matches.metadata().size(), fileOrPatternSpec);
    return ImmutableList.copyOf(matches.metadata());
}
Also used : MatchResult(org.apache.beam.sdk.io.fs.MatchResult)

Aggregations

MatchResult (org.apache.beam.sdk.io.fs.MatchResult)10 ImmutableList (com.google.common.collect.ImmutableList)4 ArrayList (java.util.ArrayList)4 File (java.io.File)3 Metadata (org.apache.beam.sdk.io.fs.MatchResult.Metadata)3 Test (org.junit.Test)3 FileNotFoundException (java.io.FileNotFoundException)2 IOException (java.io.IOException)2 List (java.util.List)2 GcsPath (org.apache.beam.sdk.util.gcsfs.GcsPath)2 Objects (com.google.api.services.storage.model.Objects)1 StorageObject (com.google.api.services.storage.model.StorageObject)1 VisibleForTesting (com.google.common.annotations.VisibleForTesting)1 BufferedReader (java.io.BufferedReader)1 FileReader (java.io.FileReader)1 URI (java.net.URI)1 StorageObjectOrIOException (org.apache.beam.sdk.util.GcsUtil.StorageObjectOrIOException)1 FileStatus (org.apache.hadoop.fs.FileStatus)1 Path (org.apache.hadoop.fs.Path)1 Matchers.anyString (org.mockito.Matchers.anyString)1