Search in sources :

Example 11 with NumberedShardedFile

use of org.apache.beam.sdk.util.NumberedShardedFile in project beam by apache.

the class TfIdfIT method testE2ETfIdf.

@Test
public void testE2ETfIdf() throws Exception {
    TfIdfITOptions options = TestPipeline.testingPipelineOptions().as(TfIdfITOptions.class);
    options.setInput(DEFAULT_INPUT);
    options.setOutput(FileSystems.matchNewResource(options.getTempRoot(), true).resolve(String.format("TfIdfIT-%tF-%<tH-%<tM-%<tS-%<tL", new Date()), StandardResolveOptions.RESOLVE_DIRECTORY).resolve("output", StandardResolveOptions.RESOLVE_DIRECTORY).resolve("results", StandardResolveOptions.RESOLVE_FILE).toString());
    TfIdf.runTfIdf(options);
    assertThat(new NumberedShardedFile(options.getOutput() + "*-of-*.csv", DEFAULT_SHARD_TEMPLATE), fileContentsHaveChecksum(EXPECTED_OUTPUT_CHECKSUM));
}
Also used : NumberedShardedFile(org.apache.beam.sdk.util.NumberedShardedFile) Date(java.util.Date) Test(org.junit.Test)

Example 12 with NumberedShardedFile

use of org.apache.beam.sdk.util.NumberedShardedFile in project beam by apache.

the class WindowedWordCountIT method testWindowedWordCountPipeline.

private void testWindowedWordCountPipeline(WindowedWordCountITOptions options) throws Exception {
    ResourceId output = FileBasedSink.convertToFileResourceIfPossible(options.getOutput());
    PerWindowFiles filenamePolicy = new PerWindowFiles(output);
    List<ShardedFile> expectedOutputFiles = Lists.newArrayListWithCapacity(6);
    for (int startMinute : ImmutableList.of(0, 10, 20, 30, 40, 50)) {
        final Instant windowStart = new Instant(options.getMinTimestampMillis()).plus(Duration.standardMinutes(startMinute));
        String filePrefix = filenamePolicy.filenamePrefixForWindow(new IntervalWindow(windowStart, windowStart.plus(Duration.standardMinutes(10))));
        expectedOutputFiles.add(new NumberedShardedFile(output.getCurrentDirectory().resolve(filePrefix, StandardResolveOptions.RESOLVE_FILE).toString() + "*"));
    }
    ShardedFile inputFile = new ExplicitShardedFile(Collections.singleton(options.getInputFile()));
    // For this integration test, input is tiny and we can build the expected counts
    SortedMap<String, Long> expectedWordCounts = new TreeMap<>();
    for (String line : inputFile.readFilesWithRetries(Sleeper.DEFAULT, BACK_OFF_FACTORY.backoff())) {
        String[] words = line.split(ExampleUtils.TOKENIZER_PATTERN, -1);
        for (String word : words) {
            if (!word.isEmpty()) {
                expectedWordCounts.put(word, MoreObjects.firstNonNull(expectedWordCounts.get(word), 0L) + 1L);
            }
        }
    }
    WindowedWordCount.runWindowedWordCount(options);
    assertThat(expectedOutputFiles, containsWordCounts(expectedWordCounts));
}
Also used : ExplicitShardedFile(org.apache.beam.sdk.util.ExplicitShardedFile) ExplicitShardedFile(org.apache.beam.sdk.util.ExplicitShardedFile) ShardedFile(org.apache.beam.sdk.util.ShardedFile) NumberedShardedFile(org.apache.beam.sdk.util.NumberedShardedFile) Instant(org.joda.time.Instant) PerWindowFiles(org.apache.beam.examples.common.WriteOneFilePerWindow.PerWindowFiles) TreeMap(java.util.TreeMap) ResourceId(org.apache.beam.sdk.io.fs.ResourceId) NumberedShardedFile(org.apache.beam.sdk.util.NumberedShardedFile) IntervalWindow(org.apache.beam.sdk.transforms.windowing.IntervalWindow)

Example 13 with NumberedShardedFile

use of org.apache.beam.sdk.util.NumberedShardedFile in project beam by apache.

the class FileChecksumMatcherTest method testMatcherThatVerifiesMultipleFiles.

@Test
public void testMatcherThatVerifiesMultipleFiles() throws IOException {
    // TODO: Java core test failing on windows, https://issues.apache.org/jira/browse/BEAM-10747
    assumeFalse(SystemUtils.IS_OS_WINDOWS);
    File tmpFile1 = tmpFolder.newFile("result-000-of-002");
    File tmpFile2 = tmpFolder.newFile("result-001-of-002");
    File tmpFile3 = tmpFolder.newFile("tmp");
    Files.write("To be or not to be, ", tmpFile1, StandardCharsets.UTF_8);
    Files.write("it is not a question.", tmpFile2, StandardCharsets.UTF_8);
    Files.write("tmp", tmpFile3, StandardCharsets.UTF_8);
    assertThat(new NumberedShardedFile(tmpFolder.getRoot().toPath().resolve("result-*").toString()), fileContentsHaveChecksum("90552392c28396935fe4f123bd0b5c2d0f6260c8"));
}
Also used : NumberedShardedFile(org.apache.beam.sdk.util.NumberedShardedFile) File(java.io.File) NumberedShardedFile(org.apache.beam.sdk.util.NumberedShardedFile) Test(org.junit.Test)

Example 14 with NumberedShardedFile

use of org.apache.beam.sdk.util.NumberedShardedFile in project beam by apache.

the class FileChecksumMatcherTest method testMatcherThatUsesCustomizedTemplate.

@Test
public void testMatcherThatUsesCustomizedTemplate() throws Exception {
    // Customized template: resultSSS-totalNNN
    // TODO: Java core test failing on windows, https://issues.apache.org/jira/browse/BEAM-10749
    assumeFalse(SystemUtils.IS_OS_WINDOWS);
    File tmpFile1 = tmpFolder.newFile("result0-total2");
    File tmpFile2 = tmpFolder.newFile("result1-total2");
    Files.write("To be or not to be, ", tmpFile1, StandardCharsets.UTF_8);
    Files.write("it is not a question.", tmpFile2, StandardCharsets.UTF_8);
    Pattern customizedTemplate = Pattern.compile("(?x) result (?<shardnum>\\d+) - total (?<numshards>\\d+)");
    assertThat(new NumberedShardedFile(tmpFolder.getRoot().toPath().resolve("*").toString(), customizedTemplate), fileContentsHaveChecksum("90552392c28396935fe4f123bd0b5c2d0f6260c8"));
}
Also used : Pattern(java.util.regex.Pattern) NumberedShardedFile(org.apache.beam.sdk.util.NumberedShardedFile) File(java.io.File) NumberedShardedFile(org.apache.beam.sdk.util.NumberedShardedFile) Test(org.junit.Test)

Aggregations

NumberedShardedFile (org.apache.beam.sdk.util.NumberedShardedFile)14 Test (org.junit.Test)13 File (java.io.File)8 Date (java.util.Date)5 ResourceId (org.apache.beam.sdk.io.fs.ResourceId)2 IOException (java.io.IOException)1 TreeMap (java.util.TreeMap)1 Pattern (java.util.regex.Pattern)1 PerWindowFiles (org.apache.beam.examples.common.WriteOneFilePerWindow.PerWindowFiles)1 Pipeline (org.apache.beam.sdk.Pipeline)1 PipelineResult (org.apache.beam.sdk.PipelineResult)1 State (org.apache.beam.sdk.PipelineResult.State)1 GcsOptions (org.apache.beam.sdk.extensions.gcp.options.GcsOptions)1 GcsUtil (org.apache.beam.sdk.extensions.gcp.util.GcsUtil)1 MatchResult (org.apache.beam.sdk.io.fs.MatchResult)1 Metadata (org.apache.beam.sdk.io.fs.MatchResult.Metadata)1 TestPipeline (org.apache.beam.sdk.testing.TestPipeline)1 TestPipelineOptions (org.apache.beam.sdk.testing.TestPipelineOptions)1 IntervalWindow (org.apache.beam.sdk.transforms.windowing.IntervalWindow)1 ExplicitShardedFile (org.apache.beam.sdk.util.ExplicitShardedFile)1