Search in sources :

Example 1 with DatasetHelper

use of org.apache.gobblin.compaction.dataset.DatasetHelper in project incubator-gobblin by apache.

the class MRCompactorJobPropCreator method obtainDatasetWithJobProps.

private Optional<Dataset> obtainDatasetWithJobProps(State jobProps, Dataset dataset) throws IOException {
    if (this.recompactFromInputPaths) {
        LOG.info(String.format("Will recompact for %s.", dataset.outputPath()));
        addInputLateFilesForFirstTimeCompaction(jobProps, dataset);
    } else {
        Set<Path> newDataFiles = new HashSet<>();
        do {
            if (renameSourceDirEnabled) {
                Set<Path> newUnrenamedDirs = MRCompactor.getDeepestLevelUnrenamedDirsWithFileExistence(this.fs, dataset.inputPaths());
                if (newUnrenamedDirs.isEmpty()) {
                    LOG.info("[{}] doesn't have unprocessed directories", dataset.getDatasetName());
                    break;
                }
                Set<Path> allFiles = getAllFilePathsRecursively(newUnrenamedDirs);
                if (allFiles.isEmpty()) {
                    LOG.info("[{}] has unprocessed directories but all empty: {}", dataset.getDatasetName(), newUnrenamedDirs);
                    break;
                }
                dataset.setRenamePaths(newUnrenamedDirs);
                newDataFiles.addAll(allFiles);
                LOG.info("[{}] has unprocessed directories: {}", dataset.getDatasetName(), newUnrenamedDirs);
            } else {
                newDataFiles = getNewDataInFolder(dataset.inputPaths(), dataset.outputPath());
                Set<Path> newDataFilesInLatePath = getNewDataInFolder(dataset.inputLatePaths(), dataset.outputPath());
                newDataFiles.addAll(newDataFilesInLatePath);
                if (newDataFiles.isEmpty()) {
                    break;
                }
                if (!newDataFilesInLatePath.isEmpty()) {
                    dataset.addAdditionalInputPaths(dataset.inputLatePaths());
                }
            }
        } while (false);
        if (newDataFiles.isEmpty()) {
            // re-compaction flow will run.
            if (isOutputLateDataExists(dataset)) {
                LOG.info("{} don't have new data, but previous late data still remains, check if it requires to move", dataset.getDatasetName());
                dataset.setJobProps(jobProps);
                dataset.checkIfNeedToRecompact(new DatasetHelper(dataset, this.fs, Lists.newArrayList("avro")));
                if (dataset.needToRecompact()) {
                    MRCompactor.modifyDatasetStateToRecompact(dataset);
                } else {
                    return Optional.absent();
                }
            } else {
                return Optional.absent();
            }
        } else {
            LOG.info(String.format("Will copy %d new data files for %s", newDataFiles.size(), dataset.outputPath()));
            jobProps.setProp(MRCompactor.COMPACTION_JOB_LATE_DATA_MOVEMENT_TASK, true);
            jobProps.setProp(MRCompactor.COMPACTION_JOB_LATE_DATA_FILES, Joiner.on(",").join(newDataFiles));
        }
    }
    dataset.setJobProps(jobProps);
    return Optional.of(dataset);
}
Also used : Path(org.apache.hadoop.fs.Path) DatasetHelper(org.apache.gobblin.compaction.dataset.DatasetHelper) HashSet(java.util.HashSet)

Example 2 with DatasetHelper

use of org.apache.gobblin.compaction.dataset.DatasetHelper in project incubator-gobblin by apache.

the class RecompactionConditionTest method testRecompactionConditionBasedOnFileCount.

@Test
public void testRecompactionConditionBasedOnFileCount() {
    try {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        fs.delete(outputLatePath, true);
        fs.mkdirs(outputLatePath);
        RecompactionConditionFactory factory = new RecompactionConditionBasedOnFileCount.Factory();
        RecompactionCondition conditionBasedOnFileCount = factory.createRecompactionCondition(dataset);
        DatasetHelper helper = new DatasetHelper(dataset, fs, Lists.newArrayList("avro"));
        fs.createNewFile(new Path(outputLatePath, new Path("1.avro")));
        fs.createNewFile(new Path(outputLatePath, new Path("2.avro")));
        Assert.assertEquals(conditionBasedOnFileCount.isRecompactionNeeded(helper), false);
        fs.createNewFile(new Path(outputLatePath, new Path("3.avro")));
        Assert.assertEquals(conditionBasedOnFileCount.isRecompactionNeeded(helper), true);
        fs.delete(outputLatePath, true);
    } catch (Exception e) {
        e.printStackTrace();
    }
}
Also used : Path(org.apache.hadoop.fs.Path) Configuration(org.apache.hadoop.conf.Configuration) FileSystem(org.apache.hadoop.fs.FileSystem) LoggerFactory(org.slf4j.LoggerFactory) RecompactionConditionFactory(org.apache.gobblin.compaction.conditions.RecompactionConditionFactory) RecompactionCondition(org.apache.gobblin.compaction.conditions.RecompactionCondition) RecompactionConditionFactory(org.apache.gobblin.compaction.conditions.RecompactionConditionFactory) DatasetHelper(org.apache.gobblin.compaction.dataset.DatasetHelper) IOException(java.io.IOException) Test(org.testng.annotations.Test)

Example 3 with DatasetHelper

use of org.apache.gobblin.compaction.dataset.DatasetHelper in project incubator-gobblin by apache.

the class RecompactionConditionTest method testRecompactionConditionBasedOnRatio.

@Test
public void testRecompactionConditionBasedOnRatio() {
    RecompactionConditionFactory factory = new RecompactionConditionBasedOnRatio.Factory();
    RecompactionCondition conditionBasedOnRatio = factory.createRecompactionCondition(dataset);
    DatasetHelper helper = mock(DatasetHelper.class);
    when(helper.getLateOutputRecordCount()).thenReturn(6L);
    when(helper.getOutputRecordCount()).thenReturn(94L);
    Assert.assertEquals(conditionBasedOnRatio.isRecompactionNeeded(helper), false);
    when(helper.getLateOutputRecordCount()).thenReturn(21L);
    when(helper.getOutputRecordCount()).thenReturn(79L);
    Assert.assertEquals(conditionBasedOnRatio.isRecompactionNeeded(helper), true);
}
Also used : LoggerFactory(org.slf4j.LoggerFactory) RecompactionConditionFactory(org.apache.gobblin.compaction.conditions.RecompactionConditionFactory) RecompactionCondition(org.apache.gobblin.compaction.conditions.RecompactionCondition) RecompactionConditionFactory(org.apache.gobblin.compaction.conditions.RecompactionConditionFactory) DatasetHelper(org.apache.gobblin.compaction.dataset.DatasetHelper) Test(org.testng.annotations.Test)

Example 4 with DatasetHelper

use of org.apache.gobblin.compaction.dataset.DatasetHelper in project incubator-gobblin by apache.

the class RecompactionConditionTest method testRecompactionCombineCondition.

@Test
public void testRecompactionCombineCondition() {
    DatasetHelper helper = mock(DatasetHelper.class);
    RecompactionCondition cond1 = mock(RecompactionConditionBasedOnRatio.class);
    RecompactionCondition cond2 = mock(RecompactionConditionBasedOnFileCount.class);
    RecompactionCondition cond3 = mock(RecompactionConditionBasedOnDuration.class);
    RecompactionCombineCondition combineConditionOr = new RecompactionCombineCondition(Arrays.asList(cond1, cond2, cond3), RecompactionCombineCondition.CombineOperation.OR);
    when(cond1.isRecompactionNeeded(helper)).thenReturn(false);
    when(cond2.isRecompactionNeeded(helper)).thenReturn(false);
    when(cond3.isRecompactionNeeded(helper)).thenReturn(false);
    Assert.assertEquals(combineConditionOr.isRecompactionNeeded(helper), false);
    when(cond1.isRecompactionNeeded(helper)).thenReturn(false);
    when(cond2.isRecompactionNeeded(helper)).thenReturn(true);
    when(cond3.isRecompactionNeeded(helper)).thenReturn(false);
    Assert.assertEquals(combineConditionOr.isRecompactionNeeded(helper), true);
    RecompactionCombineCondition combineConditionAnd = new RecompactionCombineCondition(Arrays.asList(cond1, cond2, cond3), RecompactionCombineCondition.CombineOperation.AND);
    when(cond1.isRecompactionNeeded(helper)).thenReturn(true);
    when(cond2.isRecompactionNeeded(helper)).thenReturn(true);
    when(cond3.isRecompactionNeeded(helper)).thenReturn(false);
    Assert.assertEquals(combineConditionAnd.isRecompactionNeeded(helper), false);
    when(cond1.isRecompactionNeeded(helper)).thenReturn(true);
    when(cond2.isRecompactionNeeded(helper)).thenReturn(true);
    when(cond3.isRecompactionNeeded(helper)).thenReturn(true);
    Assert.assertEquals(combineConditionAnd.isRecompactionNeeded(helper), true);
}
Also used : RecompactionCondition(org.apache.gobblin.compaction.conditions.RecompactionCondition) RecompactionCombineCondition(org.apache.gobblin.compaction.conditions.RecompactionCombineCondition) DatasetHelper(org.apache.gobblin.compaction.dataset.DatasetHelper) Test(org.testng.annotations.Test)

Example 5 with DatasetHelper

use of org.apache.gobblin.compaction.dataset.DatasetHelper in project incubator-gobblin by apache.

the class RecompactionConditionTest method testRecompactionConditionBasedOnDuration.

@Test
public void testRecompactionConditionBasedOnDuration() {
    RecompactionConditionFactory factory = new RecompactionConditionBasedOnDuration.Factory();
    RecompactionCondition conditionBasedOnDuration = factory.createRecompactionCondition(dataset);
    DatasetHelper helper = mock(DatasetHelper.class);
    when(helper.getDataset()).thenReturn(dataset);
    PeriodFormatter periodFormatter = new PeriodFormatterBuilder().appendMonths().appendSuffix("m").appendDays().appendSuffix("d").appendHours().appendSuffix("h").appendMinutes().appendSuffix("min").toFormatter();
    DateTime currentTime = getCurrentTime();
    Period period_A = periodFormatter.parsePeriod("11h59min");
    DateTime earliest_A = currentTime.minus(period_A);
    when(helper.getEarliestLateFileModificationTime()).thenReturn(Optional.of(earliest_A));
    when(helper.getCurrentTime()).thenReturn(currentTime);
    Assert.assertEquals(conditionBasedOnDuration.isRecompactionNeeded(helper), false);
    Period period_B = periodFormatter.parsePeriod("12h01min");
    DateTime earliest_B = currentTime.minus(period_B);
    when(helper.getEarliestLateFileModificationTime()).thenReturn(Optional.of(earliest_B));
    when(helper.getCurrentTime()).thenReturn(currentTime);
    Assert.assertEquals(conditionBasedOnDuration.isRecompactionNeeded(helper), true);
}
Also used : PeriodFormatterBuilder(org.joda.time.format.PeriodFormatterBuilder) PeriodFormatter(org.joda.time.format.PeriodFormatter) LoggerFactory(org.slf4j.LoggerFactory) RecompactionConditionFactory(org.apache.gobblin.compaction.conditions.RecompactionConditionFactory) Period(org.joda.time.Period) RecompactionCondition(org.apache.gobblin.compaction.conditions.RecompactionCondition) RecompactionConditionFactory(org.apache.gobblin.compaction.conditions.RecompactionConditionFactory) DatasetHelper(org.apache.gobblin.compaction.dataset.DatasetHelper) DateTime(org.joda.time.DateTime) Test(org.testng.annotations.Test)

Aggregations

DatasetHelper (org.apache.gobblin.compaction.dataset.DatasetHelper)5 RecompactionCondition (org.apache.gobblin.compaction.conditions.RecompactionCondition)4 Test (org.testng.annotations.Test)4 RecompactionConditionFactory (org.apache.gobblin.compaction.conditions.RecompactionConditionFactory)3 LoggerFactory (org.slf4j.LoggerFactory)3 Path (org.apache.hadoop.fs.Path)2 IOException (java.io.IOException)1 HashSet (java.util.HashSet)1 RecompactionCombineCondition (org.apache.gobblin.compaction.conditions.RecompactionCombineCondition)1 Configuration (org.apache.hadoop.conf.Configuration)1 FileSystem (org.apache.hadoop.fs.FileSystem)1 DateTime (org.joda.time.DateTime)1 Period (org.joda.time.Period)1 PeriodFormatter (org.joda.time.format.PeriodFormatter)1 PeriodFormatterBuilder (org.joda.time.format.PeriodFormatterBuilder)1