Search in sources :

Example 11 with DistributedLock

use of org.talend.dataprep.lock.DistributedLock in project data-prep by Talend.

the class FormatAnalysis method analyze.

/**
 * @see SynchronousDataSetAnalyzer#analyze(String)
 */
@Override
public void analyze(String dataSetId) {
    if (StringUtils.isEmpty(dataSetId)) {
        throw new IllegalArgumentException("Data set id cannot be null or empty.");
    }
    final Marker marker = Markers.dataset(dataSetId);
    DistributedLock datasetLock = repository.createDatasetMetadataLock(dataSetId);
    datasetLock.lock();
    try {
        DataSetMetadata metadata = repository.get(dataSetId);
        if (metadata != null) {
            Format detectedFormat = null;
            for (byte[] bom : BOMS) {
                try (InputStream content = store.getAsRaw(metadata, 10)) {
                    // 10 line should be enough to detect format
                    detectedFormat = detector.detect(addBOM(content, bom));
                } catch (IOException e) {
                    throw new TDPException(DataSetErrorCodes.UNABLE_TO_READ_DATASET_CONTENT, e);
                }
                if (detectedFormat != null && !(detectedFormat.getFormatFamily() instanceof UnsupportedFormatFamily)) {
                    break;
                }
            }
            LOG.debug(marker, "using {} to parse the dataset", detectedFormat);
            verifyFormat(detectedFormat);
            internalUpdateMetadata(metadata, detectedFormat);
            LOG.debug(marker, "format analysed for dataset");
        } else {
            LOG.info(marker, "Data set no longer exists.");
        }
    } finally {
        datasetLock.unlock();
    }
}
Also used : TDPException(org.talend.dataprep.exception.TDPException) DistributedLock(org.talend.dataprep.lock.DistributedLock) SequenceInputStream(java.io.SequenceInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) Marker(org.slf4j.Marker) IOException(java.io.IOException) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata)

Example 12 with DistributedLock

use of org.talend.dataprep.lock.DistributedLock in project data-prep by Talend.

the class ObjectDataSetMetadataRepository method clear.

@Override
public void clear() {
    // Remove all data set (but use lock for remaining asynchronous processes).
    list().forEach(m -> {
        if (m != null) {
            final DistributedLock lock = createDatasetMetadataLock(m.getId());
            try {
                lock.lock();
                remove(m.getId());
            } finally {
                lock.unlock();
            }
        }
    });
    LOGGER.debug("dataset metadata repository cleared.");
}
Also used : DistributedLock(org.talend.dataprep.lock.DistributedLock)

Example 13 with DistributedLock

use of org.talend.dataprep.lock.DistributedLock in project data-prep by Talend.

the class QualityAnalysis method analyze.

/**
 * Analyse the dataset metadata quality.
 *
 * @param dataSetId the dataset id.
 */
@Override
public void analyze(String dataSetId) {
    if (StringUtils.isEmpty(dataSetId)) {
        throw new IllegalArgumentException("Data set id cannot be null or empty.");
    }
    DistributedLock datasetLock = repository.createDatasetMetadataLock(dataSetId);
    datasetLock.lock();
    try {
        DataSetMetadata metadata = repository.get(dataSetId);
        if (metadata == null) {
            LOGGER.info("Unable to analyze quality of data set #{}: seems to be removed.", dataSetId);
            return;
        }
        // e.g. excel multi sheet dataset when user has not choose the sheet yet
        if (!metadata.getLifecycle().isInProgress()) {
            LOGGER.debug("No need to recompute quality of data set #{} (statistics are completed).", dataSetId);
            return;
        }
        try (Stream<DataSetRow> stream = store.stream(metadata)) {
            if (!metadata.getLifecycle().schemaAnalyzed()) {
                LOGGER.debug("Schema information must be computed before quality analysis can be performed, ignoring message");
                // no acknowledge to allow re-poll.
                return;
            }
            LOGGER.debug("Analyzing quality of dataset #{}...", metadata.getId());
            // New data set, or reached the max limit of records for synchronous analysis, trigger a full scan (but
            // async).
            final long dataSetSize = metadata.getContent().getNbRecords();
            final boolean isNewDataSet = dataSetSize == 0;
            if (isNewDataSet || dataSetSize == maxRecord) {
                // If data set size is maxRecord, performs a full scan, otherwise only take first maxRecord
                // records.
                computeQuality(metadata, stream, dataSetSize == maxRecord ? -1 : maxRecord);
            }
            // Turn on / off "in progress" flag
            if (isNewDataSet && metadata.getContent().getNbRecords() >= maxRecord) {
                metadata.getLifecycle().setInProgress(true);
            } else {
                metadata.getLifecycle().setInProgress(false);
            }
            // ... all quality is now analyzed, mark it so.
            metadata.getLifecycle().qualityAnalyzed(true);
            repository.save(metadata);
            LOGGER.debug("Analyzed quality of dataset #{}.", dataSetId);
        } catch (Exception e) {
            LOGGER.warn("dataset '{}' generate an error, message: {} ", dataSetId, e.getMessage());
            throw new TDPException(DataSetErrorCodes.UNABLE_TO_ANALYZE_DATASET_QUALITY, e);
        }
    } finally {
        datasetLock.unlock();
    }
}
Also used : TDPException(org.talend.dataprep.exception.TDPException) DistributedLock(org.talend.dataprep.lock.DistributedLock) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) TDPException(org.talend.dataprep.exception.TDPException)

Example 14 with DistributedLock

use of org.talend.dataprep.lock.DistributedLock in project data-prep by Talend.

the class SchemaAnalysis method analyze.

@Override
public void analyze(String dataSetId) {
    if (StringUtils.isEmpty(dataSetId)) {
        throw new IllegalArgumentException("Data set id cannot be null or empty.");
    }
    DistributedLock datasetLock = repository.createDatasetMetadataLock(dataSetId);
    datasetLock.lock();
    try {
        DataSetMetadata metadata = repository.get(dataSetId);
        if (metadata == null) {
            LOGGER.info("Unable to analyze schema of data set #{}: seems to be removed.", dataSetId);
            return;
        }
        // Schema analysis
        try (Stream<DataSetRow> stream = store.stream(metadata, 100)) {
            LOGGER.info("Analyzing schema in dataset #{}...", dataSetId);
            // Configure analyzers
            final List<ColumnMetadata> columns = metadata.getRowMetadata().getColumns();
            try (Analyzer<Analyzers.Result> analyzer = analyzerService.schemaAnalysis(columns)) {
                // Determine schema for the content.
                stream.limit(100).map(row -> row.toArray(DataSetRow.SKIP_TDP_ID)).forEach(analyzer::analyze);
                // Find the best suitable type
                adapter.adapt(columns, analyzer.getResult());
                LOGGER.info("Analyzed schema in dataset #{}.", dataSetId);
                metadata.getLifecycle().schemaAnalyzed(true);
                repository.save(metadata);
            }
        } catch (Exception e) {
            LOGGER.error("Unable to analyse schema for dataset " + dataSetId + ".", e);
            TDPException.rethrowOrWrap(e, UNABLE_TO_ANALYZE_COLUMN_TYPES);
        }
    } finally {
        datasetLock.unlock();
    }
}
Also used : Analyzers(org.talend.dataquality.common.inference.Analyzers) StringUtils(org.apache.commons.lang.StringUtils) TDPException(org.talend.dataprep.exception.TDPException) Logger(org.slf4j.Logger) DataSetMetadataRepository(org.talend.dataprep.dataset.store.metadata.DataSetMetadataRepository) LoggerFactory(org.slf4j.LoggerFactory) Autowired(org.springframework.beans.factory.annotation.Autowired) AnalyzerService(org.talend.dataprep.quality.AnalyzerService) List(java.util.List) Component(org.springframework.stereotype.Component) Stream(java.util.stream.Stream) UNABLE_TO_ANALYZE_COLUMN_TYPES(org.talend.dataprep.exception.error.DataSetErrorCodes.UNABLE_TO_ANALYZE_COLUMN_TYPES) DistributedLock(org.talend.dataprep.lock.DistributedLock) StatisticsAdapter(org.talend.dataprep.dataset.StatisticsAdapter) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) Analyzer(org.talend.dataquality.common.inference.Analyzer) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) ContentStoreRouter(org.talend.dataprep.dataset.store.content.ContentStoreRouter) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DistributedLock(org.talend.dataprep.lock.DistributedLock) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) TDPException(org.talend.dataprep.exception.TDPException)

Aggregations

DistributedLock (org.talend.dataprep.lock.DistributedLock)14 DataSetMetadata (org.talend.dataprep.api.dataset.DataSetMetadata)13 TDPException (org.talend.dataprep.exception.TDPException)8 ApiOperation (io.swagger.annotations.ApiOperation)5 RequestMapping (org.springframework.web.bind.annotation.RequestMapping)5 ColumnMetadata (org.talend.dataprep.api.dataset.ColumnMetadata)5 Timed (org.talend.dataprep.metrics.Timed)5 InputStream (java.io.InputStream)4 DataSetRow (org.talend.dataprep.api.dataset.row.DataSetRow)4 IOException (java.io.IOException)3 PipedInputStream (java.io.PipedInputStream)3 Marker (org.slf4j.Marker)3 SemanticDomain (org.talend.dataprep.api.dataset.statistics.SemanticDomain)3 OutputStream (java.io.OutputStream)2 PipedOutputStream (java.io.PipedOutputStream)2 List (java.util.List)2 Stream (java.util.stream.Stream)2 Logger (org.slf4j.Logger)2 LoggerFactory (org.slf4j.LoggerFactory)2 Autowired (org.springframework.beans.factory.annotation.Autowired)2