Search in sources :

Example 1 with ParseExceptionHandler

use of org.apache.druid.segment.incremental.ParseExceptionHandler in project druid by druid-io.

the class UnifiedIndexerAppenderatorsManagerTest method setup.

@Before
public void setup() {
    appenderatorConfig = EasyMock.createMock(AppenderatorConfig.class);
    EasyMock.expect(appenderatorConfig.getMaxPendingPersists()).andReturn(0);
    EasyMock.expect(appenderatorConfig.isSkipBytesInMemoryOverheadCheck()).andReturn(false);
    EasyMock.replay(appenderatorConfig);
    appenderator = manager.createClosedSegmentsOfflineAppenderatorForTask("taskId", new DataSchema("myDataSource", new TimestampSpec("__time", "millis", null), null, null, new UniformGranularitySpec(Granularities.HOUR, Granularities.HOUR, false, Collections.emptyList()), null), appenderatorConfig, new FireDepartmentMetrics(), new NoopDataSegmentPusher(), TestHelper.makeJsonMapper(), TestHelper.getTestIndexIO(), TestHelper.getTestIndexMergerV9(OnHeapMemorySegmentWriteOutMediumFactory.instance()), new NoopRowIngestionMeters(), new ParseExceptionHandler(new NoopRowIngestionMeters(), false, 0, 0), true);
}
Also used : DataSchema(org.apache.druid.segment.indexing.DataSchema) UniformGranularitySpec(org.apache.druid.segment.indexing.granularity.UniformGranularitySpec) NoopDataSegmentPusher(org.apache.druid.segment.loading.NoopDataSegmentPusher) FireDepartmentMetrics(org.apache.druid.segment.realtime.FireDepartmentMetrics) NoopRowIngestionMeters(org.apache.druid.segment.incremental.NoopRowIngestionMeters) TimestampSpec(org.apache.druid.data.input.impl.TimestampSpec) ParseExceptionHandler(org.apache.druid.segment.incremental.ParseExceptionHandler) Before(org.junit.Before)

Example 2 with ParseExceptionHandler

use of org.apache.druid.segment.incremental.ParseExceptionHandler in project druid by druid-io.

the class PartialSegmentGenerateTask method generateSegments.

private List<DataSegment> generateSegments(final TaskToolbox toolbox, final ParallelIndexSupervisorTaskClient taskClient, final InputSource inputSource, final File tmpDir) throws IOException, InterruptedException, ExecutionException, TimeoutException {
    final DataSchema dataSchema = ingestionSchema.getDataSchema();
    final FireDepartment fireDepartmentForMetrics = new FireDepartment(dataSchema, new RealtimeIOConfig(null, null), null);
    final FireDepartmentMetrics fireDepartmentMetrics = fireDepartmentForMetrics.getMetrics();
    final RowIngestionMeters buildSegmentsMeters = toolbox.getRowIngestionMetersFactory().createRowIngestionMeters();
    toolbox.addMonitor(new RealtimeMetricsMonitor(Collections.singletonList(fireDepartmentForMetrics), Collections.singletonMap(DruidMetrics.TASK_ID, new String[] { getId() })));
    final ParallelIndexTuningConfig tuningConfig = ingestionSchema.getTuningConfig();
    final PartitionsSpec partitionsSpec = tuningConfig.getGivenOrDefaultPartitionsSpec();
    final long pushTimeout = tuningConfig.getPushTimeout();
    final SegmentAllocatorForBatch segmentAllocator = createSegmentAllocator(toolbox, taskClient);
    final SequenceNameFunction sequenceNameFunction = segmentAllocator.getSequenceNameFunction();
    final ParseExceptionHandler parseExceptionHandler = new ParseExceptionHandler(buildSegmentsMeters, tuningConfig.isLogParseExceptions(), tuningConfig.getMaxParseExceptions(), tuningConfig.getMaxSavedParseExceptions());
    final boolean useMaxMemoryEstimates = getContextValue(Tasks.USE_MAX_MEMORY_ESTIMATES, Tasks.DEFAULT_USE_MAX_MEMORY_ESTIMATES);
    final Appenderator appenderator = BatchAppenderators.newAppenderator(getId(), toolbox.getAppenderatorsManager(), fireDepartmentMetrics, toolbox, dataSchema, tuningConfig, new ShuffleDataSegmentPusher(supervisorTaskId, getId(), toolbox.getIntermediaryDataManager()), buildSegmentsMeters, parseExceptionHandler, useMaxMemoryEstimates);
    boolean exceptionOccurred = false;
    try (final BatchAppenderatorDriver driver = BatchAppenderators.newDriver(appenderator, toolbox, segmentAllocator)) {
        driver.startJob();
        final SegmentsAndCommitMetadata pushed = InputSourceProcessor.process(dataSchema, driver, partitionsSpec, inputSource, inputSource.needsFormat() ? ParallelIndexSupervisorTask.getInputFormat(ingestionSchema) : null, tmpDir, sequenceNameFunction, inputRowIteratorBuilder, buildSegmentsMeters, parseExceptionHandler, pushTimeout);
        return pushed.getSegments();
    } catch (Exception e) {
        exceptionOccurred = true;
        throw e;
    } finally {
        if (exceptionOccurred) {
            appenderator.closeNow();
        } else {
            appenderator.close();
        }
    }
}
Also used : RealtimeIOConfig(org.apache.druid.segment.indexing.RealtimeIOConfig) ShuffleDataSegmentPusher(org.apache.druid.indexing.worker.shuffle.ShuffleDataSegmentPusher) SegmentsAndCommitMetadata(org.apache.druid.segment.realtime.appenderator.SegmentsAndCommitMetadata) BatchAppenderatorDriver(org.apache.druid.segment.realtime.appenderator.BatchAppenderatorDriver) TimeoutException(java.util.concurrent.TimeoutException) IOException(java.io.IOException) ExecutionException(java.util.concurrent.ExecutionException) DataSchema(org.apache.druid.segment.indexing.DataSchema) FireDepartment(org.apache.druid.segment.realtime.FireDepartment) FireDepartmentMetrics(org.apache.druid.segment.realtime.FireDepartmentMetrics) SegmentAllocatorForBatch(org.apache.druid.indexing.common.task.SegmentAllocatorForBatch) Appenderator(org.apache.druid.segment.realtime.appenderator.Appenderator) PartitionsSpec(org.apache.druid.indexer.partitions.PartitionsSpec) ParseExceptionHandler(org.apache.druid.segment.incremental.ParseExceptionHandler) RealtimeMetricsMonitor(org.apache.druid.segment.realtime.RealtimeMetricsMonitor) SequenceNameFunction(org.apache.druid.indexing.common.task.SequenceNameFunction) RowIngestionMeters(org.apache.druid.segment.incremental.RowIngestionMeters)

Example 3 with ParseExceptionHandler

use of org.apache.druid.segment.incremental.ParseExceptionHandler in project druid by druid-io.

the class IndexTask method runTask.

@Override
public TaskStatus runTask(final TaskToolbox toolbox) {
    try {
        log.debug("Found chat handler of class[%s]", toolbox.getChatHandlerProvider().getClass().getName());
        if (toolbox.getChatHandlerProvider().get(getId()).isPresent()) {
            // This is a workaround for ParallelIndexSupervisorTask to avoid double registering when it runs in the
            // sequential mode. See ParallelIndexSupervisorTask.runSequential().
            // Note that all HTTP endpoints are not available in this case. This works only for
            // ParallelIndexSupervisorTask because it doesn't support APIs for live ingestion reports.
            log.warn("Chat handler is already registered. Skipping chat handler registration.");
        } else {
            toolbox.getChatHandlerProvider().register(getId(), this, false);
        }
        this.authorizerMapper = toolbox.getAuthorizerMapper();
        this.determinePartitionsMeters = toolbox.getRowIngestionMetersFactory().createRowIngestionMeters();
        this.buildSegmentsMeters = toolbox.getRowIngestionMetersFactory().createRowIngestionMeters();
        this.determinePartitionsParseExceptionHandler = new ParseExceptionHandler(determinePartitionsMeters, ingestionSchema.getTuningConfig().isLogParseExceptions(), ingestionSchema.getTuningConfig().getMaxParseExceptions(), ingestionSchema.getTuningConfig().getMaxSavedParseExceptions());
        this.buildSegmentsParseExceptionHandler = new ParseExceptionHandler(buildSegmentsMeters, ingestionSchema.getTuningConfig().isLogParseExceptions(), ingestionSchema.getTuningConfig().getMaxParseExceptions(), ingestionSchema.getTuningConfig().getMaxSavedParseExceptions());
        final boolean determineIntervals = ingestionSchema.getDataSchema().getGranularitySpec().inputIntervals().isEmpty();
        final InputSource inputSource = ingestionSchema.getIOConfig().getNonNullInputSource(ingestionSchema.getDataSchema().getParser());
        final File tmpDir = toolbox.getIndexingTmpDir();
        ingestionState = IngestionState.DETERMINE_PARTITIONS;
        // Initialize maxRowsPerSegment and maxTotalRows lazily
        final IndexTuningConfig tuningConfig = ingestionSchema.tuningConfig;
        final PartitionsSpec partitionsSpec = tuningConfig.getGivenOrDefaultPartitionsSpec();
        final PartitionAnalysis partitionAnalysis = determineShardSpecs(toolbox, inputSource, tmpDir, partitionsSpec);
        final List<Interval> allocateIntervals = new ArrayList<>(partitionAnalysis.getAllIntervalsToIndex());
        final DataSchema dataSchema;
        if (determineIntervals) {
            final boolean gotLocks = determineLockGranularityAndTryLock(toolbox.getTaskActionClient(), allocateIntervals, ingestionSchema.getIOConfig());
            if (!gotLocks) {
                throw new ISE("Failed to get locks for intervals[%s]", allocateIntervals);
            }
            dataSchema = ingestionSchema.getDataSchema().withGranularitySpec(ingestionSchema.getDataSchema().getGranularitySpec().withIntervals(JodaUtils.condenseIntervals(allocateIntervals)));
        } else {
            dataSchema = ingestionSchema.getDataSchema();
        }
        ingestionState = IngestionState.BUILD_SEGMENTS;
        return generateAndPublishSegments(toolbox, dataSchema, inputSource, tmpDir, partitionAnalysis);
    } catch (Exception e) {
        log.error(e, "Encountered exception in %s.", ingestionState);
        errorMsg = Throwables.getStackTraceAsString(e);
        toolbox.getTaskReportFileWriter().write(getId(), getTaskCompletionReports());
        return TaskStatus.failure(getId(), errorMsg);
    } finally {
        toolbox.getChatHandlerProvider().unregister(getId());
    }
}
Also used : InputSource(org.apache.druid.data.input.InputSource) CompletePartitionAnalysis(org.apache.druid.indexing.common.task.batch.partition.CompletePartitionAnalysis) LinearPartitionAnalysis(org.apache.druid.indexing.common.task.batch.partition.LinearPartitionAnalysis) HashPartitionAnalysis(org.apache.druid.indexing.common.task.batch.partition.HashPartitionAnalysis) PartitionAnalysis(org.apache.druid.indexing.common.task.batch.partition.PartitionAnalysis) ArrayList(java.util.ArrayList) IOException(java.io.IOException) ExecutionException(java.util.concurrent.ExecutionException) TimeoutException(java.util.concurrent.TimeoutException) DataSchema(org.apache.druid.segment.indexing.DataSchema) PartitionsSpec(org.apache.druid.indexer.partitions.PartitionsSpec) DynamicPartitionsSpec(org.apache.druid.indexer.partitions.DynamicPartitionsSpec) HashedPartitionsSpec(org.apache.druid.indexer.partitions.HashedPartitionsSpec) ParseExceptionHandler(org.apache.druid.segment.incremental.ParseExceptionHandler) ISE(org.apache.druid.java.util.common.ISE) File(java.io.File) Interval(org.joda.time.Interval)

Example 4 with ParseExceptionHandler

use of org.apache.druid.segment.incremental.ParseExceptionHandler in project druid by druid-io.

the class PartialDimensionCardinalityTask method runTask.

@Override
public TaskStatus runTask(TaskToolbox toolbox) throws Exception {
    DataSchema dataSchema = ingestionSchema.getDataSchema();
    GranularitySpec granularitySpec = dataSchema.getGranularitySpec();
    ParallelIndexTuningConfig tuningConfig = ingestionSchema.getTuningConfig();
    HashedPartitionsSpec partitionsSpec = (HashedPartitionsSpec) tuningConfig.getPartitionsSpec();
    Preconditions.checkNotNull(partitionsSpec, "partitionsSpec required in tuningConfig");
    InputSource inputSource = ingestionSchema.getIOConfig().getNonNullInputSource(ingestionSchema.getDataSchema().getParser());
    InputFormat inputFormat = inputSource.needsFormat() ? ParallelIndexSupervisorTask.getInputFormat(ingestionSchema) : null;
    final RowIngestionMeters buildSegmentsMeters = toolbox.getRowIngestionMetersFactory().createRowIngestionMeters();
    final ParseExceptionHandler parseExceptionHandler = new ParseExceptionHandler(buildSegmentsMeters, tuningConfig.isLogParseExceptions(), tuningConfig.getMaxParseExceptions(), tuningConfig.getMaxSavedParseExceptions());
    final boolean determineIntervals = granularitySpec.inputIntervals().isEmpty();
    try (final CloseableIterator<InputRow> inputRowIterator = AbstractBatchIndexTask.inputSourceReader(toolbox.getIndexingTmpDir(), dataSchema, inputSource, inputFormat, determineIntervals ? Objects::nonNull : AbstractBatchIndexTask.defaultRowFilter(granularitySpec), buildSegmentsMeters, parseExceptionHandler)) {
        Map<Interval, byte[]> cardinalities = determineCardinalities(inputRowIterator, granularitySpec);
        sendReport(toolbox, new DimensionCardinalityReport(getId(), cardinalities));
    }
    return TaskStatus.success(getId());
}
Also used : HashedPartitionsSpec(org.apache.druid.indexer.partitions.HashedPartitionsSpec) InputSource(org.apache.druid.data.input.InputSource) DataSchema(org.apache.druid.segment.indexing.DataSchema) GranularitySpec(org.apache.druid.segment.indexing.granularity.GranularitySpec) InputFormat(org.apache.druid.data.input.InputFormat) ParseExceptionHandler(org.apache.druid.segment.incremental.ParseExceptionHandler) InputRow(org.apache.druid.data.input.InputRow) RowIngestionMeters(org.apache.druid.segment.incremental.RowIngestionMeters) Interval(org.joda.time.Interval)

Example 5 with ParseExceptionHandler

use of org.apache.druid.segment.incremental.ParseExceptionHandler in project druid by druid-io.

the class PartialDimensionDistributionTask method runTask.

@Override
public TaskStatus runTask(TaskToolbox toolbox) throws Exception {
    DataSchema dataSchema = ingestionSchema.getDataSchema();
    GranularitySpec granularitySpec = dataSchema.getGranularitySpec();
    ParallelIndexTuningConfig tuningConfig = ingestionSchema.getTuningConfig();
    DimensionRangePartitionsSpec partitionsSpec = (DimensionRangePartitionsSpec) tuningConfig.getPartitionsSpec();
    Preconditions.checkNotNull(partitionsSpec, "partitionsSpec required in tuningConfig");
    final List<String> partitionDimensions = partitionsSpec.getPartitionDimensions();
    Preconditions.checkArgument(partitionDimensions != null && !partitionDimensions.isEmpty(), "partitionDimension required in partitionsSpec");
    boolean isAssumeGrouped = partitionsSpec.isAssumeGrouped();
    InputSource inputSource = ingestionSchema.getIOConfig().getNonNullInputSource(ingestionSchema.getDataSchema().getParser());
    InputFormat inputFormat = inputSource.needsFormat() ? ParallelIndexSupervisorTask.getInputFormat(ingestionSchema) : null;
    final RowIngestionMeters buildSegmentsMeters = toolbox.getRowIngestionMetersFactory().createRowIngestionMeters();
    final ParseExceptionHandler parseExceptionHandler = new ParseExceptionHandler(buildSegmentsMeters, tuningConfig.isLogParseExceptions(), tuningConfig.getMaxParseExceptions(), tuningConfig.getMaxSavedParseExceptions());
    final boolean determineIntervals = granularitySpec.inputIntervals().isEmpty();
    try (final CloseableIterator<InputRow> inputRowIterator = AbstractBatchIndexTask.inputSourceReader(toolbox.getIndexingTmpDir(), dataSchema, inputSource, inputFormat, determineIntervals ? Objects::nonNull : AbstractBatchIndexTask.defaultRowFilter(granularitySpec), buildSegmentsMeters, parseExceptionHandler);
        HandlingInputRowIterator iterator = new RangePartitionIndexTaskInputRowIteratorBuilder(partitionDimensions, SKIP_NULL).delegate(inputRowIterator).granularitySpec(granularitySpec).build()) {
        Map<Interval, StringDistribution> distribution = determineDistribution(iterator, granularitySpec, partitionDimensions, isAssumeGrouped);
        sendReport(toolbox, new DimensionDistributionReport(getId(), distribution));
    }
    return TaskStatus.success(getId());
}
Also used : InputSource(org.apache.druid.data.input.InputSource) StringDistribution(org.apache.druid.indexing.common.task.batch.parallel.distribution.StringDistribution) DimensionRangePartitionsSpec(org.apache.druid.indexer.partitions.DimensionRangePartitionsSpec) HandlingInputRowIterator(org.apache.druid.data.input.HandlingInputRowIterator) DataSchema(org.apache.druid.segment.indexing.DataSchema) GranularitySpec(org.apache.druid.segment.indexing.granularity.GranularitySpec) InputFormat(org.apache.druid.data.input.InputFormat) ParseExceptionHandler(org.apache.druid.segment.incremental.ParseExceptionHandler) InputRow(org.apache.druid.data.input.InputRow) RangePartitionIndexTaskInputRowIteratorBuilder(org.apache.druid.indexing.common.task.batch.parallel.iterator.RangePartitionIndexTaskInputRowIteratorBuilder) RowIngestionMeters(org.apache.druid.segment.incremental.RowIngestionMeters) Interval(org.joda.time.Interval)

Aggregations

ParseExceptionHandler (org.apache.druid.segment.incremental.ParseExceptionHandler)9 DataSchema (org.apache.druid.segment.indexing.DataSchema)6 IOException (java.io.IOException)5 ExecutionException (java.util.concurrent.ExecutionException)5 TimeoutException (java.util.concurrent.TimeoutException)5 RowIngestionMeters (org.apache.druid.segment.incremental.RowIngestionMeters)5 InputRow (org.apache.druid.data.input.InputRow)4 InputSource (org.apache.druid.data.input.InputSource)4 InputFormat (org.apache.druid.data.input.InputFormat)3 DynamicPartitionsSpec (org.apache.druid.indexer.partitions.DynamicPartitionsSpec)3 IngestionStatsAndErrorsTaskReport (org.apache.druid.indexing.common.IngestionStatsAndErrorsTaskReport)3 TaskReport (org.apache.druid.indexing.common.TaskReport)3 ISE (org.apache.druid.java.util.common.ISE)3 Interval (org.joda.time.Interval)3 VisibleForTesting (com.google.common.annotations.VisibleForTesting)2 Preconditions (com.google.common.base.Preconditions)2 Supplier (com.google.common.base.Supplier)2 Throwables (com.google.common.base.Throwables)2 ImmutableMap (com.google.common.collect.ImmutableMap)2 Futures (com.google.common.util.concurrent.Futures)2