Search in sources :

Example 6 with Analyzer

use of org.talend.dataquality.common.inference.Analyzer in project data-prep by Talend.

the class SchemaAnalysis method analyze.

@Override
public void analyze(String dataSetId) {
    if (StringUtils.isEmpty(dataSetId)) {
        throw new IllegalArgumentException("Data set id cannot be null or empty.");
    }
    DistributedLock datasetLock = repository.createDatasetMetadataLock(dataSetId);
    datasetLock.lock();
    try {
        DataSetMetadata metadata = repository.get(dataSetId);
        if (metadata == null) {
            LOGGER.info("Unable to analyze schema of data set #{}: seems to be removed.", dataSetId);
            return;
        }
        // Schema analysis
        try (Stream<DataSetRow> stream = store.stream(metadata, 100)) {
            LOGGER.info("Analyzing schema in dataset #{}...", dataSetId);
            // Configure analyzers
            final List<ColumnMetadata> columns = metadata.getRowMetadata().getColumns();
            try (Analyzer<Analyzers.Result> analyzer = analyzerService.schemaAnalysis(columns)) {
                // Determine schema for the content.
                stream.limit(100).map(row -> row.toArray(DataSetRow.SKIP_TDP_ID)).forEach(analyzer::analyze);
                // Find the best suitable type
                adapter.adapt(columns, analyzer.getResult());
                LOGGER.info("Analyzed schema in dataset #{}.", dataSetId);
                metadata.getLifecycle().schemaAnalyzed(true);
                repository.save(metadata);
            }
        } catch (Exception e) {
            LOGGER.error("Unable to analyse schema for dataset " + dataSetId + ".", e);
            TDPException.rethrowOrWrap(e, UNABLE_TO_ANALYZE_COLUMN_TYPES);
        }
    } finally {
        datasetLock.unlock();
    }
}
Also used : Analyzers(org.talend.dataquality.common.inference.Analyzers) StringUtils(org.apache.commons.lang.StringUtils) TDPException(org.talend.dataprep.exception.TDPException) Logger(org.slf4j.Logger) DataSetMetadataRepository(org.talend.dataprep.dataset.store.metadata.DataSetMetadataRepository) LoggerFactory(org.slf4j.LoggerFactory) Autowired(org.springframework.beans.factory.annotation.Autowired) AnalyzerService(org.talend.dataprep.quality.AnalyzerService) List(java.util.List) Component(org.springframework.stereotype.Component) Stream(java.util.stream.Stream) UNABLE_TO_ANALYZE_COLUMN_TYPES(org.talend.dataprep.exception.error.DataSetErrorCodes.UNABLE_TO_ANALYZE_COLUMN_TYPES) DistributedLock(org.talend.dataprep.lock.DistributedLock) StatisticsAdapter(org.talend.dataprep.dataset.StatisticsAdapter) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) Analyzer(org.talend.dataquality.common.inference.Analyzer) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) ContentStoreRouter(org.talend.dataprep.dataset.store.content.ContentStoreRouter) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DistributedLock(org.talend.dataprep.lock.DistributedLock) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) TDPException(org.talend.dataprep.exception.TDPException)

Aggregations

ColumnMetadata (org.talend.dataprep.api.dataset.ColumnMetadata)6 Analyzer (org.talend.dataquality.common.inference.Analyzer)6 Analyzers (org.talend.dataquality.common.inference.Analyzers)6 Stream (java.util.stream.Stream)5 Logger (org.slf4j.Logger)5 LoggerFactory (org.slf4j.LoggerFactory)5 Autowired (org.springframework.beans.factory.annotation.Autowired)5 DataSetMetadata (org.talend.dataprep.api.dataset.DataSetMetadata)5 TDPException (org.talend.dataprep.exception.TDPException)5 AnalyzerService (org.talend.dataprep.quality.AnalyzerService)5 List (java.util.List)4 StringUtils (org.apache.commons.lang.StringUtils)4 Collectors (java.util.stream.Collectors)3 Value (org.springframework.beans.factory.annotation.Value)3 DataSetRow (org.talend.dataprep.api.dataset.row.DataSetRow)3 StatisticsAdapter (org.talend.dataprep.dataset.StatisticsAdapter)3 DistributedLock (org.talend.dataprep.lock.DistributedLock)3 Api (io.swagger.annotations.Api)2 ApiOperation (io.swagger.annotations.ApiOperation)2 ApiParam (io.swagger.annotations.ApiParam)2