Examples with HadoopIngestionSpec - org.apache.druid.indexer.HadoopIngestionSpec

Example 1 with HadoopIngestionSpec

use of org.apache.druid.indexer.HadoopIngestionSpec in project druid by druid-io.

the class StaticPathSpecTest method testAddingPaths.

@Test
public void testAddingPaths() throws Exception {
    Job job = new Job();
    StaticPathSpec pathSpec = new StaticPathSpec("/a/c,/a/b/{c,d}", null);
    DataSchema schema = new DataSchema("ds", null, new AggregatorFactory[0], null, null, jsonMapper);
    HadoopIOConfig io = new HadoopIOConfig(null, null, null);
    pathSpec.addInputPaths(new HadoopDruidIndexerConfig(new HadoopIngestionSpec(schema, io, null)), job);
    String paths = job.getConfiguration().get(MultipleInputs.DIR_FORMATS);
    String formatter = TextInputFormat.class.getName();
    String[] expected = { "/a/c;" + formatter, "/a/b/c;" + formatter, "/a/b/d;" + formatter };
    Assert.assertArrayEquals(expected, paths.split(","));
}

Also used : DataSchema(org.apache.druid.segment.indexing.DataSchema) HadoopIngestionSpec(org.apache.druid.indexer.HadoopIngestionSpec) Job(org.apache.hadoop.mapreduce.Job) HadoopDruidIndexerConfig(org.apache.druid.indexer.HadoopDruidIndexerConfig) HadoopIOConfig(org.apache.druid.indexer.HadoopIOConfig) Test(org.junit.Test)

Example 2 with HadoopIngestionSpec

use of org.apache.druid.indexer.HadoopIngestionSpec in project druid by druid-io.

the class GranularityPathSpecTest method testIntervalTrimming.

@Test
public void testIntervalTrimming() throws Exception {
    UserGroupInformation.setLoginUser(UserGroupInformation.createUserForTesting("test", new String[] { "testGroup" }));
    HadoopIngestionSpec spec = new HadoopIngestionSpec(new DataSchema("foo", null, new AggregatorFactory[0], new UniformGranularitySpec(Granularities.DAY, Granularities.ALL, ImmutableList.of(Intervals.of("2015-01-01T11Z/2015-01-02T05Z"))), null, jsonMapper), new HadoopIOConfig(null, null, null), DEFAULT_TUNING_CONFIG);
    granularityPathSpec.setDataGranularity(Granularities.HOUR);
    granularityPathSpec.setPathFormat("yyyy/MM/dd/HH");
    granularityPathSpec.setFilePattern(".*");
    granularityPathSpec.setInputFormat(TextInputFormat.class);
    Job job = Job.getInstance();
    String formatStr = "file:%s/%s;org.apache.hadoop.mapreduce.lib.input.TextInputFormat";
    createFile(testFolder, "test/2015/01/01/00/file1", "test/2015/01/01/10/file2", "test/2015/01/01/18/file3", "test/2015/01/02/00/file1", "test/2015/01/02/03/file2", "test/2015/01/02/05/file3", "test/2015/01/02/07/file4", "test/2015/01/02/09/file5");
    granularityPathSpec.setInputPath(testFolder.getRoot().getPath() + "/test");
    granularityPathSpec.addInputPaths(HadoopDruidIndexerConfig.fromSpec(spec), job);
    String actual = job.getConfiguration().get("mapreduce.input.multipleinputs.dir.formats");
    String expected = Joiner.on(",").join(Lists.newArrayList(StringUtils.format(formatStr, testFolder.getRoot(), "test/2015/01/01/18/file3"), StringUtils.format(formatStr, testFolder.getRoot(), "test/2015/01/02/00/file1"), StringUtils.format(formatStr, testFolder.getRoot(), "test/2015/01/02/03/file2")));
    Assert.assertEquals("Did not find expected input paths", expected, actual);
}

Also used : HadoopIngestionSpec(org.apache.druid.indexer.HadoopIngestionSpec) DataSchema(org.apache.druid.segment.indexing.DataSchema) UniformGranularitySpec(org.apache.druid.segment.indexing.granularity.UniformGranularitySpec) AggregatorFactory(org.apache.druid.query.aggregation.AggregatorFactory) Job(org.apache.hadoop.mapreduce.Job) HadoopIOConfig(org.apache.druid.indexer.HadoopIOConfig) Test(org.junit.Test)

Example 3 with HadoopIngestionSpec

use of org.apache.druid.indexer.HadoopIngestionSpec in project druid by druid-io.

the class HadoopIndexTask method runInternal.

@SuppressWarnings("unchecked")
private TaskStatus runInternal(TaskToolbox toolbox) throws Exception {
    boolean indexGeneratorJobAttempted = false;
    boolean indexGeneratorJobSuccess = false;
    HadoopIngestionSpec indexerSchema = null;
    try {
        registerResourceCloserOnAbnormalExit(config -> killHadoopJob());
        String hadoopJobIdFile = getHadoopJobIdFileName();
        final ClassLoader loader = buildClassLoader(toolbox);
        boolean determineIntervals = spec.getDataSchema().getGranularitySpec().inputIntervals().isEmpty();
        HadoopIngestionSpec.updateSegmentListIfDatasourcePathSpecIsUsed(spec, jsonMapper, new OverlordActionBasedUsedSegmentsRetriever(toolbox));
        Object determinePartitionsInnerProcessingRunner = getForeignClassloaderObject("org.apache.druid.indexing.common.task.HadoopIndexTask$HadoopDetermineConfigInnerProcessingRunner", loader);
        determinePartitionsStatsGetter = new InnerProcessingStatsGetter(determinePartitionsInnerProcessingRunner);
        String[] determinePartitionsInput = new String[] { toolbox.getJsonMapper().writeValueAsString(spec), toolbox.getConfig().getHadoopWorkingPath(), toolbox.getSegmentPusher().getPathForHadoop(), hadoopJobIdFile };
        final ClassLoader oldLoader = Thread.currentThread().getContextClassLoader();
        Class<?> determinePartitionsRunnerClass = determinePartitionsInnerProcessingRunner.getClass();
        Method determinePartitionsInnerProcessingRunTask = determinePartitionsRunnerClass.getMethod("runTask", determinePartitionsInput.getClass());
        try {
            Thread.currentThread().setContextClassLoader(loader);
            ingestionState = IngestionState.DETERMINE_PARTITIONS;
            final String determineConfigStatusString = (String) determinePartitionsInnerProcessingRunTask.invoke(determinePartitionsInnerProcessingRunner, new Object[] { determinePartitionsInput });
            determineConfigStatus = toolbox.getJsonMapper().readValue(determineConfigStatusString, HadoopDetermineConfigInnerProcessingStatus.class);
            indexerSchema = determineConfigStatus.getSchema();
            if (indexerSchema == null) {
                errorMsg = determineConfigStatus.getErrorMsg();
                toolbox.getTaskReportFileWriter().write(getId(), getTaskCompletionReports());
                return TaskStatus.failure(getId(), errorMsg);
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            Thread.currentThread().setContextClassLoader(oldLoader);
        }
        // We should have a lock from before we started running only if interval was specified
        String version;
        if (determineIntervals) {
            Interval interval = JodaUtils.umbrellaInterval(JodaUtils.condenseIntervals(indexerSchema.getDataSchema().getGranularitySpec().sortedBucketIntervals()));
            final long lockTimeoutMs = getContextValue(Tasks.LOCK_TIMEOUT_KEY, Tasks.DEFAULT_LOCK_TIMEOUT_MILLIS);
            // Note: if lockTimeoutMs is larger than ServerConfig.maxIdleTime, the below line can incur http timeout error.
            final TaskLock lock = Preconditions.checkNotNull(toolbox.getTaskActionClient().submit(new TimeChunkLockAcquireAction(TaskLockType.EXCLUSIVE, interval, lockTimeoutMs)), "Cannot acquire a lock for interval[%s]", interval);
            if (lock.isRevoked()) {
                throw new ISE(StringUtils.format("Lock for interval [%s] was revoked.", interval));
            }
            version = lock.getVersion();
        } else {
            Iterable<TaskLock> locks = getTaskLocks(toolbox.getTaskActionClient());
            final TaskLock myLock = Iterables.getOnlyElement(locks);
            version = myLock.getVersion();
        }
        final String specVersion = indexerSchema.getTuningConfig().getVersion();
        if (indexerSchema.getTuningConfig().isUseExplicitVersion()) {
            if (specVersion.compareTo(version) < 0) {
                version = specVersion;
            } else {
                String errMsg = StringUtils.format("Spec version can not be greater than or equal to the lock version, Spec version: [%s] Lock version: [%s].", specVersion, version);
                log.error(errMsg);
                toolbox.getTaskReportFileWriter().write(getId(), null);
                return TaskStatus.failure(getId(), errMsg);
            }
        }
        log.info("Setting version to: %s", version);
        Object innerProcessingRunner = getForeignClassloaderObject("org.apache.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessingRunner", loader);
        buildSegmentsStatsGetter = new InnerProcessingStatsGetter(innerProcessingRunner);
        String[] buildSegmentsInput = new String[] { toolbox.getJsonMapper().writeValueAsString(indexerSchema), version, hadoopJobIdFile };
        Class<?> buildSegmentsRunnerClass = innerProcessingRunner.getClass();
        Method innerProcessingRunTask = buildSegmentsRunnerClass.getMethod("runTask", buildSegmentsInput.getClass());
        try {
            Thread.currentThread().setContextClassLoader(loader);
            ingestionState = IngestionState.BUILD_SEGMENTS;
            indexGeneratorJobAttempted = true;
            final String jobStatusString = (String) innerProcessingRunTask.invoke(innerProcessingRunner, new Object[] { buildSegmentsInput });
            buildSegmentsStatus = toolbox.getJsonMapper().readValue(jobStatusString, HadoopIndexGeneratorInnerProcessingStatus.class);
            List<DataSegmentAndIndexZipFilePath> dataSegmentAndIndexZipFilePaths = buildSegmentsStatus.getDataSegmentAndIndexZipFilePaths();
            if (dataSegmentAndIndexZipFilePaths != null) {
                indexGeneratorJobSuccess = true;
                renameSegmentIndexFilesJob(toolbox.getJsonMapper().writeValueAsString(indexerSchema), toolbox.getJsonMapper().writeValueAsString(dataSegmentAndIndexZipFilePaths));
                ArrayList<DataSegment> segments = new ArrayList<>(dataSegmentAndIndexZipFilePaths.stream().map(DataSegmentAndIndexZipFilePath::getSegment).collect(Collectors.toList()));
                toolbox.publishSegments(segments);
                // for awaitSegmentAvailabilityTimeoutMillis
                if (spec.getTuningConfig().getAwaitSegmentAvailabilityTimeoutMillis() > 0) {
                    ingestionState = IngestionState.SEGMENT_AVAILABILITY_WAIT;
                    waitForSegmentAvailability(toolbox, segments, spec.getTuningConfig().getAwaitSegmentAvailabilityTimeoutMillis());
                }
                ingestionState = IngestionState.COMPLETED;
                toolbox.getTaskReportFileWriter().write(getId(), getTaskCompletionReports());
                return TaskStatus.success(getId());
            } else {
                errorMsg = buildSegmentsStatus.getErrorMsg();
                toolbox.getTaskReportFileWriter().write(getId(), getTaskCompletionReports());
                return TaskStatus.failure(getId(), errorMsg);
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        } finally {
            Thread.currentThread().setContextClassLoader(oldLoader);
        }
    } finally {
        indexerGeneratorCleanupJob(indexGeneratorJobAttempted, indexGeneratorJobSuccess, indexerSchema == null ? null : toolbox.getJsonMapper().writeValueAsString(indexerSchema));
    }
}

Also used : HadoopIngestionSpec(org.apache.druid.indexer.HadoopIngestionSpec) ArrayList(java.util.ArrayList) OverlordActionBasedUsedSegmentsRetriever(org.apache.druid.indexing.hadoop.OverlordActionBasedUsedSegmentsRetriever) DataSegment(org.apache.druid.timeline.DataSegment) TaskLock(org.apache.druid.indexing.common.TaskLock) ISE(org.apache.druid.java.util.common.ISE) DataSegmentAndIndexZipFilePath(org.apache.druid.indexer.DataSegmentAndIndexZipFilePath) Method(java.lang.reflect.Method) InvocationTargetException(java.lang.reflect.InvocationTargetException) TimeChunkLockAcquireAction(org.apache.druid.indexing.common.actions.TimeChunkLockAcquireAction) Interval(org.joda.time.Interval)

Example 4 with HadoopIngestionSpec

use of org.apache.druid.indexer.HadoopIngestionSpec in project druid by druid-io.

the class MaterializedViewSupervisorSpec method createTask.

public HadoopIndexTask createTask(Interval interval, String version, List<DataSegment> segments) {
    String taskId = StringUtils.format("%s_%s_%s", TASK_PREFIX, dataSourceName, DateTimes.nowUtc());
    // generate parser
    Map<String, Object> parseSpec = new HashMap<>();
    parseSpec.put("format", "timeAndDims");
    parseSpec.put("dimensionsSpec", dimensionsSpec);
    Map<String, Object> parser = new HashMap<>();
    parser.put("type", "map");
    parser.put("parseSpec", parseSpec);
    // generate HadoopTuningConfig
    HadoopTuningConfig tuningConfigForTask = new HadoopTuningConfig(tuningConfig.getWorkingPath(), version, tuningConfig.getPartitionsSpec(), tuningConfig.getShardSpecs(), tuningConfig.getIndexSpec(), tuningConfig.getIndexSpecForIntermediatePersists(), tuningConfig.getAppendableIndexSpec(), tuningConfig.getMaxRowsInMemory(), tuningConfig.getMaxBytesInMemory(), tuningConfig.isLeaveIntermediate(), tuningConfig.isCleanupOnFailure(), tuningConfig.isOverwriteFiles(), tuningConfig.isIgnoreInvalidRows(), tuningConfig.getJobProperties(), tuningConfig.isCombineText(), tuningConfig.getUseCombiner(), tuningConfig.getMaxRowsInMemory(), tuningConfig.getNumBackgroundPersistThreads(), tuningConfig.isForceExtendableShardSpecs(), true, tuningConfig.getUserAllowedHadoopPrefix(), tuningConfig.isLogParseExceptions(), tuningConfig.getMaxParseExceptions(), tuningConfig.isUseYarnRMJobStatusFallback(), tuningConfig.getAwaitSegmentAvailabilityTimeoutMillis());
    // generate granularity
    ArbitraryGranularitySpec granularitySpec = new ArbitraryGranularitySpec(Granularities.NONE, ImmutableList.of(interval));
    // generate DataSchema
    DataSchema dataSchema = new DataSchema(dataSourceName, parser, aggregators, granularitySpec, TransformSpec.NONE, objectMapper);
    // generate DatasourceIngestionSpec
    DatasourceIngestionSpec datasourceIngestionSpec = new DatasourceIngestionSpec(baseDataSource, null, ImmutableList.of(interval), segments, null, null, null, false, null);
    // generate HadoopIOConfig
    Map<String, Object> inputSpec = new HashMap<>();
    inputSpec.put("type", "dataSource");
    inputSpec.put("ingestionSpec", datasourceIngestionSpec);
    HadoopIOConfig hadoopIOConfig = new HadoopIOConfig(inputSpec, null, null);
    // generate HadoopIngestionSpec
    HadoopIngestionSpec spec = new HadoopIngestionSpec(dataSchema, hadoopIOConfig, tuningConfigForTask);
    // generate HadoopIndexTask
    HadoopIndexTask task = new HadoopIndexTask(taskId, spec, hadoopCoordinates, hadoopDependencyCoordinates, classpathPrefix, objectMapper, context, authorizerMapper, chatHandlerProvider);
    return task;
}

Also used : DataSchema(org.apache.druid.segment.indexing.DataSchema) DatasourceIngestionSpec(org.apache.druid.indexer.hadoop.DatasourceIngestionSpec) HadoopIngestionSpec(org.apache.druid.indexer.HadoopIngestionSpec) HashMap(java.util.HashMap) HadoopTuningConfig(org.apache.druid.indexer.HadoopTuningConfig) ArbitraryGranularitySpec(org.apache.druid.segment.indexing.granularity.ArbitraryGranularitySpec) HadoopIndexTask(org.apache.druid.indexing.common.task.HadoopIndexTask) HadoopIOConfig(org.apache.druid.indexer.HadoopIOConfig)

Example 5 with HadoopIngestionSpec

use of org.apache.druid.indexer.HadoopIngestionSpec in project druid by druid-io.

the class HdfsDataSegmentPusherTest method shouldMakeDefaultSegmentOutputPathIfNotHDFS.

@Test
public void shouldMakeDefaultSegmentOutputPathIfNotHDFS() {
    final HadoopIngestionSpec schema;
    try {
        schema = objectMapper.readValue("{\n" + "    \"dataSchema\": {\n" + "        \"dataSource\": \"the:data:source\",\n" + "        \"metricsSpec\": [],\n" + "        \"granularitySpec\": {\n" + "            \"type\": \"uniform\",\n" + "            \"segmentGranularity\": \"hour\",\n" + "            \"intervals\": [\"2012-07-10/P1D\"]\n" + "        }\n" + "    },\n" + "    \"ioConfig\": {\n" + "        \"type\": \"hadoop\",\n" + "        \"segmentOutputPath\": \"/tmp/dru:id/data:test\"\n" + "    }\n" + "}", HadoopIngestionSpec.class);
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
    HadoopDruidIndexerConfig cfg = new HadoopDruidIndexerConfig(schema.withTuningConfig(schema.getTuningConfig().withVersion("some:brand:new:version")));
    Bucket bucket = new Bucket(4711, new DateTime(2012, 07, 10, 5, 30, ISOChronology.getInstanceUTC()), 4712);
    Path path = JobHelper.makeFileNamePath(new Path(cfg.getSchema().getIOConfig().getSegmentOutputPath()), new LocalFileSystem(), new DataSegment(cfg.getSchema().getDataSchema().getDataSource(), cfg.getSchema().getDataSchema().getGranularitySpec().bucketInterval(bucket.time).get(), cfg.getSchema().getTuningConfig().getVersion(), null, null, null, new NumberedShardSpec(bucket.partitionNum, 5000), -1, 0), JobHelper.INDEX_ZIP, new LocalDataSegmentPusher(new LocalDataSegmentPusherConfig()));
    Assert.assertEquals("file:/tmp/dru:id/data:test/the:data:source/2012-07-10T05:00:00.000Z_2012-07-10T06:00:00.000Z/some:brand:new:" + "version/4712/index.zip", path.toString());
    path = JobHelper.makeTmpPath(new Path(cfg.getSchema().getIOConfig().getSegmentOutputPath()), new LocalFileSystem(), new DataSegment(cfg.getSchema().getDataSchema().getDataSource(), cfg.getSchema().getDataSchema().getGranularitySpec().bucketInterval(bucket.time).get(), cfg.getSchema().getTuningConfig().getVersion(), null, null, null, new NumberedShardSpec(bucket.partitionNum, 5000), -1, 0), new TaskAttemptID("abc", 123, TaskType.REDUCE, 1, 0), new LocalDataSegmentPusher(new LocalDataSegmentPusherConfig()));
    Assert.assertEquals("file:/tmp/dru:id/data:test/the:data:source/2012-07-10T05:00:00.000Z_2012-07-10T06:00:00.000Z/some:brand:new:" + "version/4712/index.zip.0", path.toString());
}

Also used : HadoopIngestionSpec(org.apache.druid.indexer.HadoopIngestionSpec) Path(org.apache.hadoop.fs.Path) TaskAttemptID(org.apache.hadoop.mapreduce.TaskAttemptID) LocalDataSegmentPusherConfig(org.apache.druid.segment.loading.LocalDataSegmentPusherConfig) HadoopDruidIndexerConfig(org.apache.druid.indexer.HadoopDruidIndexerConfig) DataSegment(org.apache.druid.timeline.DataSegment) LocalDataSegmentPusher(org.apache.druid.segment.loading.LocalDataSegmentPusher) ExpectedException(org.junit.rules.ExpectedException) IOException(java.io.IOException) DateTime(org.joda.time.DateTime) Bucket(org.apache.druid.indexer.Bucket) LocalFileSystem(org.apache.hadoop.fs.LocalFileSystem) NumberedShardSpec(org.apache.druid.timeline.partition.NumberedShardSpec) Test(org.junit.Test)

Aggregations

HadoopIngestionSpec (org.apache.druid.indexer.HadoopIngestionSpec)9 Test (org.junit.Test)7 HadoopIOConfig (org.apache.druid.indexer.HadoopIOConfig)6 DataSchema (org.apache.druid.segment.indexing.DataSchema)6 DataSegment (org.apache.druid.timeline.DataSegment)4 HadoopDruidIndexerConfig (org.apache.druid.indexer.HadoopDruidIndexerConfig)3 UniformGranularitySpec (org.apache.druid.segment.indexing.granularity.UniformGranularitySpec)3 Job (org.apache.hadoop.mapreduce.Job)3 IOException (java.io.IOException)2 HashMap (java.util.HashMap)2 Bucket (org.apache.druid.indexer.Bucket)2 HadoopIndexTask (org.apache.druid.indexing.common.task.HadoopIndexTask)2 AggregatorFactory (org.apache.druid.query.aggregation.AggregatorFactory)2 NumberedShardSpec (org.apache.druid.timeline.partition.NumberedShardSpec)2 Path (org.apache.hadoop.fs.Path)2 TaskAttemptID (org.apache.hadoop.mapreduce.TaskAttemptID)2 DateTime (org.joda.time.DateTime)2 Interval (org.joda.time.Interval)2 ExpectedException (org.junit.rules.ExpectedException)2 ImmutableMap (com.google.common.collect.ImmutableMap)1