Search in sources :

Example 1 with Analyzer

use of org.talend.dataquality.common.inference.Analyzer in project data-prep by Talend.

the class DataSetContentStore method stream.

/**
 * Similarly to {@link #get(DataSetMetadata)} returns the content of the data set but as a {@link Stream stream} of
 * {@link DataSetRow rows} instead of JSON content.
 *
 * @param dataSetMetadata The {@link DataSetMetadata data set} to read rows from.
 * @param limit A limit to pass to raw content supplier (use -1 for "no limit). Used as parameter to call
 * {@link #get(DataSetMetadata, long)}.
 * @return A valid <b>{@link DataSetRow}</b> stream.
 */
public Stream<DataSetRow> stream(DataSetMetadata dataSetMetadata, long limit) {
    final InputStream inputStream = get(dataSetMetadata, limit);
    final DataSetRowIterator iterator = new DataSetRowIterator(inputStream);
    final Iterable<DataSetRow> rowIterable = () -> iterator;
    Stream<DataSetRow> dataSetRowStream = StreamSupport.stream(rowIterable.spliterator(), false);
    // make sure to close the original input stream when closing this one
    AtomicLong tdpId = new AtomicLong(1);
    final List<ColumnMetadata> columns = dataSetMetadata.getRowMetadata().getColumns();
    final Analyzer<Analyzers.Result> analyzer = service.build(columns, AnalyzerService.Analysis.QUALITY);
    dataSetRowStream = dataSetRowStream.filter(r -> !r.isEmpty()).map(r -> {
        final String[] values = r.order(columns).toArray(DataSetRow.SKIP_TDP_ID);
        analyzer.analyze(values);
        return r;
    }).map(// Mark invalid columns as detected by provided analyzer.
    new InvalidMarker(columns, analyzer)).map(r -> {
        // 
        r.setTdpId(tdpId.getAndIncrement());
        return r;
    }).onClose(() -> {
        // 
        try {
            inputStream.close();
        } catch (Exception e) {
            throw new TDPException(CommonErrorCodes.UNEXPECTED_EXCEPTION, e);
        }
    });
    return dataSetRowStream;
}
Also used : Analyzers(org.talend.dataquality.common.inference.Analyzers) DataSetRowIterator(org.talend.dataprep.api.dataset.json.DataSetRowIterator) TDPException(org.talend.dataprep.exception.TDPException) FormatFamilyFactory(org.talend.dataprep.schema.FormatFamilyFactory) ObjectMapper(com.fasterxml.jackson.databind.ObjectMapper) Autowired(org.springframework.beans.factory.annotation.Autowired) DataSetContent(org.talend.dataprep.api.dataset.DataSetContent) Value(org.springframework.beans.factory.annotation.Value) AnalyzerService(org.talend.dataprep.quality.AnalyzerService) AtomicLong(java.util.concurrent.atomic.AtomicLong) List(java.util.List) Stream(java.util.stream.Stream) InvalidMarker(org.talend.dataprep.api.dataset.row.InvalidMarker) Serializer(org.talend.dataprep.schema.Serializer) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) CommonErrorCodes(org.talend.dataprep.exception.error.CommonErrorCodes) Analyzer(org.talend.dataquality.common.inference.Analyzer) StreamSupport(java.util.stream.StreamSupport) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) InputStream(java.io.InputStream) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSetRowIterator(org.talend.dataprep.api.dataset.json.DataSetRowIterator) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) InputStream(java.io.InputStream) TDPException(org.talend.dataprep.exception.TDPException) TDPException(org.talend.dataprep.exception.TDPException) AtomicLong(java.util.concurrent.atomic.AtomicLong) InvalidMarker(org.talend.dataprep.api.dataset.row.InvalidMarker) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow)

Example 2 with Analyzer

use of org.talend.dataquality.common.inference.Analyzer in project data-prep by Talend.

the class TransformationService method getSemanticDomains.

/**
 * Return the semantic domains for the given parameters.
 *
 * @param metadata the dataset metadata.
 * @param columnId the column id to analyze.
 * @param records the dataset records.
 * @return the semantic domains for the given parameters.
 * @throws IOException can happen...
 */
private List<SemanticDomain> getSemanticDomains(DataSetMetadata metadata, String columnId, InputStream records) throws IOException {
    // copy the column metadata and set the semantic domain forced flag to false to make sure the statistics adapter set all
    // available domains
    final ColumnMetadata columnMetadata = // 
    column().copy(// 
    metadata.getRowMetadata().getById(columnId)).semanticDomainForce(// 
    false).build();
    final Analyzer<Analyzers.Result> analyzer = analyzerService.build(columnMetadata, SEMANTIC);
    analyzer.init();
    try (final JsonParser parser = mapper.getFactory().createParser(new InputStreamReader(records, UTF_8))) {
        final DataSet dataSet = mapper.readerFor(DataSet.class).readValue(parser);
        dataSet.getRecords().map(// 
        r -> r.get(columnId)).forEach(analyzer::analyze);
        analyzer.end();
    }
    final List<Analyzers.Result> analyzerResult = analyzer.getResult();
    statisticsAdapter.adapt(singletonList(columnMetadata), analyzerResult);
    return columnMetadata.getSemanticDomains();
}
Also used : VolumeMetered(org.talend.dataprep.metrics.VolumeMetered) LocaleContextHolder(org.springframework.context.i18n.LocaleContextHolder) StringUtils(org.apache.commons.lang.StringUtils) ContentCacheKey(org.talend.dataprep.cache.ContentCacheKey) TdqCategories(org.talend.dataquality.semantic.broadcast.TdqCategories) Autowired(org.springframework.beans.factory.annotation.Autowired) ApiParam(io.swagger.annotations.ApiParam) PreviewParameters(org.talend.dataprep.transformation.preview.api.PreviewParameters) ExportFormatMessage(org.talend.dataprep.format.export.ExportFormatMessage) ActionContext(org.talend.dataprep.transformation.api.action.context.ActionContext) Collections.singletonList(java.util.Collections.singletonList) ScopeCategory(org.talend.dataprep.transformation.actions.category.ScopeCategory) Valid(javax.validation.Valid) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) BeanConversionService(org.talend.dataprep.conversions.BeanConversionService) GetPrepMetadataAsyncCondition(org.talend.dataprep.async.conditional.GetPrepMetadataAsyncCondition) TaskExecutor(org.springframework.core.task.TaskExecutor) DataSet(org.talend.dataprep.api.dataset.DataSet) ExportParametersUtil(org.talend.dataprep.api.export.ExportParametersUtil) StepDiff(org.talend.dataprep.api.preparation.StepDiff) PreparationDetailsGet(org.talend.dataprep.command.preparation.PreparationDetailsGet) HEAD(org.talend.dataprep.api.export.ExportParameters.SourceType.HEAD) APPLICATION_OCTET_STREAM_VALUE(org.springframework.http.MediaType.APPLICATION_OCTET_STREAM_VALUE) Resource(javax.annotation.Resource) StreamingResponseBody(org.springframework.web.servlet.mvc.method.annotation.StreamingResponseBody) JSON(org.talend.dataprep.transformation.format.JsonFormat.JSON) PreparationGetContentUrlGenerator(org.talend.dataprep.async.result.PreparationGetContentUrlGenerator) SecurityProxy(org.talend.dataprep.security.SecurityProxy) Stream(java.util.stream.Stream) Builder.column(org.talend.dataprep.api.dataset.ColumnMetadata.Builder.column) org.springframework.web.bind.annotation(org.springframework.web.bind.annotation) GZIPOutputStream(java.util.zip.GZIPOutputStream) DynamicType(org.talend.dataprep.transformation.api.action.dynamic.DynamicType) RunnableAction(org.talend.dataprep.transformation.actions.common.RunnableAction) Analyzers(org.talend.dataquality.common.inference.Analyzers) java.util(java.util) TransformationErrorCodes(org.talend.dataprep.exception.error.TransformationErrorCodes) GenericParameter(org.talend.dataprep.transformation.api.action.dynamic.GenericParameter) Configuration(org.talend.dataprep.transformation.api.transformer.configuration.Configuration) PreviewConfiguration(org.talend.dataprep.transformation.api.transformer.configuration.PreviewConfiguration) AnalyzerService(org.talend.dataprep.quality.AnalyzerService) TransformationMetadataCacheKey(org.talend.dataprep.cache.TransformationMetadataCacheKey) PrepMetadataExecutionIdGenerator(org.talend.dataprep.async.generator.PrepMetadataExecutionIdGenerator) PREPARATION_DOES_NOT_EXIST(org.talend.dataprep.exception.error.PreparationErrorCodes.PREPARATION_DOES_NOT_EXIST) Api(io.swagger.annotations.Api) Preparation(org.talend.dataprep.api.preparation.Preparation) ActionRegistry(org.talend.dataprep.transformation.pipeline.ActionRegistry) GetPrepContentAsyncCondition(org.talend.dataprep.async.conditional.GetPrepContentAsyncCondition) TransformationContext(org.talend.dataprep.transformation.api.action.context.TransformationContext) AggregationService(org.talend.dataprep.transformation.aggregation.AggregationService) NullOutputStream(org.apache.commons.io.output.NullOutputStream) StatisticsAdapter(org.talend.dataprep.dataset.StatisticsAdapter) DataSetGetMetadata(org.talend.dataprep.command.dataset.DataSetGetMetadata) Timed(org.talend.dataprep.metrics.Timed) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSetGet(org.talend.dataprep.command.dataset.DataSetGet) AggregationResult(org.talend.dataprep.transformation.aggregation.api.AggregationResult) LoggerFactory(org.slf4j.LoggerFactory) Flag(org.talend.dataprep.api.dataset.row.Flag) SEMANTIC(org.talend.dataprep.quality.AnalyzerService.Analysis.SEMANTIC) ActionParser(org.talend.dataprep.transformation.api.action.ActionParser) CacheKeyGenerator(org.talend.dataprep.cache.CacheKeyGenerator) ApiOperation(io.swagger.annotations.ApiOperation) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) ExportParameters(org.talend.dataprep.api.export.ExportParameters) PrepMetadataGetContentUrlGenerator(org.talend.dataprep.async.result.PrepMetadataGetContentUrlGenerator) MediaType(org.springframework.http.MediaType) PublicAPI(org.talend.dataprep.security.PublicAPI) RequestMethod(org.springframework.web.bind.annotation.RequestMethod) Collectors(java.util.stream.Collectors) ContentCache(org.talend.dataprep.cache.ContentCache) UNEXPECTED_EXCEPTION(org.talend.dataprep.exception.error.TransformationErrorCodes.UNEXPECTED_EXCEPTION) TransformerFactory(org.talend.dataprep.transformation.api.transformer.TransformerFactory) CommonErrorCodes(org.talend.dataprep.exception.error.CommonErrorCodes) Analyzer(org.talend.dataquality.common.inference.Analyzer) ActionDefinition(org.talend.dataprep.api.action.ActionDefinition) PreparationExportStrategy(org.talend.dataprep.transformation.service.export.PreparationExportStrategy) RowMetadata(org.talend.dataprep.api.dataset.RowMetadata) ExportFormat(org.talend.dataprep.format.export.ExportFormat) TDPException(org.talend.dataprep.exception.TDPException) JsonErrorCodeDescription(org.talend.dataprep.exception.json.JsonErrorCodeDescription) ExceptionContext.build(org.talend.daikon.exception.ExceptionContext.build) ExportParametersExecutionIdGenerator(org.talend.dataprep.async.generator.ExportParametersExecutionIdGenerator) ExceptionContext(org.talend.daikon.exception.ExceptionContext) org.talend.dataprep.async(org.talend.dataprep.async) Suggestion(org.talend.dataprep.transformation.api.transformer.suggestion.Suggestion) SuggestionEngine(org.talend.dataprep.transformation.api.transformer.suggestion.SuggestionEngine) Logger(org.slf4j.Logger) LocaleContextHolder.getLocale(org.springframework.context.i18n.LocaleContextHolder.getLocale) JsonParser(com.fasterxml.jackson.core.JsonParser) UTF_8(java.nio.charset.StandardCharsets.UTF_8) Step(org.talend.dataprep.api.preparation.Step) APPLICATION_JSON_VALUE(org.springframework.http.MediaType.APPLICATION_JSON_VALUE) ApplicationContext(org.springframework.context.ApplicationContext) ActionForm(org.talend.dataprep.api.action.ActionForm) AggregationParameters(org.talend.dataprep.transformation.aggregation.api.AggregationParameters) java.io(java.io) TdqCategoriesFactory(org.talend.dataquality.semantic.broadcast.TdqCategoriesFactory) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSet(org.talend.dataprep.api.dataset.DataSet) AggregationResult(org.talend.dataprep.transformation.aggregation.api.AggregationResult) JsonParser(com.fasterxml.jackson.core.JsonParser)

Example 3 with Analyzer

use of org.talend.dataquality.common.inference.Analyzer in project data-prep by Talend.

the class AnalyzerService method build.

/**
 * Build a {@link Analyzer} to analyze records with columns (in <code>columns</code>). <code>settings</code> give
 * all the wanted analysis settings for the analyzer.
 *
 * @param columns  A list of columns, may be null or empty.
 * @param settings A varargs with {@link Analysis}. Duplicates are possible in varargs but will be considered only
 *                 once.
 * @return A ready to use {@link Analyzer}.
 */
public Analyzer<Analyzers.Result> build(List<ColumnMetadata> columns, Analysis... settings) {
    if (columns == null || columns.isEmpty()) {
        return Analyzers.with(NullAnalyzer.INSTANCE);
    }
    // Get all needed analysis
    final Set<Analysis> all = EnumSet.noneOf(Analysis.class);
    for (Analysis setting : settings) {
        if (setting != null) {
            all.add(setting);
            all.addAll(Arrays.asList(setting.dependencies));
        }
    }
    if (all.isEmpty()) {
        return Analyzers.with(NullAnalyzer.INSTANCE);
    }
    // Column types
    DataTypeEnum[] types = TypeUtils.convert(columns);
    // Semantic domains
    List<String> domainList = // 
    columns.stream().map(// 
    ColumnMetadata::getDomain).map(// 
    d -> StringUtils.isBlank(d) ? SemanticCategoryEnum.UNKNOWN.getId() : d).collect(Collectors.toList());
    final String[] domains = domainList.toArray(new String[domainList.size()]);
    DictionarySnapshot dictionarySnapshot = dictionarySnapshotProvider.get();
    // Build all analyzers
    List<Analyzer> analyzers = new ArrayList<>();
    for (Analysis setting : settings) {
        switch(setting) {
            case SEMANTIC:
                final SemanticAnalyzer semanticAnalyzer = new SemanticAnalyzer(dictionarySnapshot);
                semanticAnalyzer.setLimit(Integer.MAX_VALUE);
                semanticAnalyzer.setMetadata(Metadata.HEADER_NAME, extractColumnNames(columns));
                analyzers.add(semanticAnalyzer);
                break;
            case HISTOGRAM:
                analyzers.add(new StreamDateHistogramAnalyzer(columns, types, dateParser));
                analyzers.add(new StreamNumberHistogramAnalyzer(types));
                break;
            case QUALITY:
                final DataTypeQualityAnalyzer dataTypeQualityAnalyzer = new DataTypeQualityAnalyzer(types);
                columns.forEach(c -> dataTypeQualityAnalyzer.addCustomDateTimePattern(RowMetadataUtils.getMostUsedDatePattern(c)));
                analyzers.add(new ValueQualityAnalyzer(dataTypeQualityAnalyzer, new SemanticQualityAnalyzer(dictionarySnapshot, domains, false), // NOSONAR
                true));
                break;
            case CARDINALITY:
                analyzers.add(new CardinalityAnalyzer());
                break;
            case PATTERNS:
                analyzers.add(buildPatternAnalyzer(columns));
                break;
            case LENGTH:
                analyzers.add(new TextLengthAnalyzer());
                break;
            case QUANTILES:
                boolean acceptQuantiles = false;
                for (DataTypeEnum type : types) {
                    if (type == DataTypeEnum.INTEGER || type == DataTypeEnum.DOUBLE) {
                        acceptQuantiles = true;
                        break;
                    }
                }
                if (acceptQuantiles) {
                    analyzers.add(new QuantileAnalyzer(types));
                }
                break;
            case SUMMARY:
                analyzers.add(new SummaryAnalyzer(types));
                break;
            case TYPE:
                boolean shouldUseTypeAnalysis = true;
                for (Analysis analysis : settings) {
                    if (analysis == Analysis.QUALITY) {
                        shouldUseTypeAnalysis = false;
                        break;
                    }
                }
                if (shouldUseTypeAnalysis) {
                    final List<String> mostUsedDatePatterns = getMostUsedDatePatterns(columns);
                    analyzers.add(new DataTypeAnalyzer(mostUsedDatePatterns));
                } else {
                    LOGGER.warn("Disabled {} analysis (conflicts with {}).", setting, Analysis.QUALITY);
                }
                break;
            case FREQUENCY:
                analyzers.add(new DataTypeFrequencyAnalyzer());
                break;
            default:
                throw new IllegalArgumentException("Missing support for '" + setting + "'.");
        }
    }
    // Merge all analyzers into one
    final Analyzer<Analyzers.Result> analyzer = Analyzers.with(analyzers.toArray(new Analyzer[analyzers.size()]));
    analyzer.init();
    if (LOGGER.isDebugEnabled()) {
        // Wrap analyzer for usage monitoring (to diagnose non-closed analyzer issues).
        return new ResourceMonitoredAnalyzer(analyzer);
    } else {
        return analyzer;
    }
}
Also used : Analyzers(org.talend.dataquality.common.inference.Analyzers) java.util(java.util) StringUtils(org.apache.commons.lang.StringUtils) CardinalityStatistics(org.talend.dataquality.statistics.cardinality.CardinalityStatistics) TypeUtils(org.talend.dataprep.api.type.TypeUtils) Metadata(org.talend.dataquality.common.inference.Metadata) DateParser(org.talend.dataprep.transformation.actions.date.DateParser) DataTypeFrequencyStatistics(org.talend.dataquality.statistics.frequency.DataTypeFrequencyStatistics) LoggerFactory(org.slf4j.LoggerFactory) SemanticCategoryEnum(org.talend.dataquality.semantic.classifier.SemanticCategoryEnum) TextLengthAnalyzer(org.talend.dataquality.statistics.text.TextLengthAnalyzer) DateTimePatternRecognizer(org.talend.dataquality.statistics.frequency.recognition.DateTimePatternRecognizer) DataTypeEnum(org.talend.dataquality.statistics.type.DataTypeEnum) ValueQualityStatistics(org.talend.dataquality.common.inference.ValueQualityStatistics) ValueQualityAnalyzer(org.talend.dataquality.statistics.quality.ValueQualityAnalyzer) SummaryStatistics(org.talend.dataquality.statistics.numeric.summary.SummaryStatistics) AbstractFrequencyAnalyzer(org.talend.dataquality.statistics.frequency.AbstractFrequencyAnalyzer) SemanticQualityAnalyzer(org.talend.dataquality.semantic.statistics.SemanticQualityAnalyzer) StreamNumberHistogramAnalyzer(org.talend.dataprep.api.dataset.statistics.number.StreamNumberHistogramAnalyzer) EmptyPatternRecognizer(org.talend.dataquality.statistics.frequency.recognition.EmptyPatternRecognizer) SemanticType(org.talend.dataquality.semantic.statistics.SemanticType) PrintWriter(java.io.PrintWriter) DictionarySnapshotProvider(org.talend.dataquality.semantic.snapshot.DictionarySnapshotProvider) DataTypeFrequencyAnalyzer(org.talend.dataquality.statistics.frequency.DataTypeFrequencyAnalyzer) Logger(org.slf4j.Logger) LatinExtendedCharPatternRecognizer(org.talend.dataquality.statistics.frequency.recognition.LatinExtendedCharPatternRecognizer) QuantileAnalyzer(org.talend.dataquality.statistics.numeric.quantile.QuantileAnalyzer) StringWriter(java.io.StringWriter) RowMetadataUtils(org.talend.dataprep.api.dataset.row.RowMetadataUtils) StreamDateHistogramStatistics(org.talend.dataprep.api.dataset.statistics.date.StreamDateHistogramStatistics) StandardDictionarySnapshotProvider(org.talend.dataquality.semantic.snapshot.StandardDictionarySnapshotProvider) StreamDateHistogramAnalyzer(org.talend.dataprep.api.dataset.statistics.date.StreamDateHistogramAnalyzer) NullAnalyzer(org.talend.dataprep.transformation.api.transformer.json.NullAnalyzer) DictionarySnapshot(org.talend.dataquality.semantic.snapshot.DictionarySnapshot) CardinalityAnalyzer(org.talend.dataquality.statistics.cardinality.CardinalityAnalyzer) TextLengthStatistics(org.talend.dataquality.statistics.text.TextLengthStatistics) DataTypeAnalyzer(org.talend.dataquality.statistics.type.DataTypeAnalyzer) Collectors(java.util.stream.Collectors) SummaryAnalyzer(org.talend.dataquality.statistics.numeric.summary.SummaryAnalyzer) SemanticAnalyzer(org.talend.dataquality.semantic.statistics.SemanticAnalyzer) PatternFrequencyStatistics(org.talend.dataquality.statistics.frequency.pattern.PatternFrequencyStatistics) Analyzer(org.talend.dataquality.common.inference.Analyzer) QuantileStatistics(org.talend.dataquality.statistics.numeric.quantile.QuantileStatistics) AbstractPatternRecognizer(org.talend.dataquality.statistics.frequency.recognition.AbstractPatternRecognizer) DataTypeQualityAnalyzer(org.talend.dataquality.statistics.quality.DataTypeQualityAnalyzer) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) CompositePatternFrequencyAnalyzer(org.talend.dataquality.statistics.frequency.pattern.CompositePatternFrequencyAnalyzer) DataTypeOccurences(org.talend.dataquality.statistics.type.DataTypeOccurences) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) TextLengthAnalyzer(org.talend.dataquality.statistics.text.TextLengthAnalyzer) ValueQualityAnalyzer(org.talend.dataquality.statistics.quality.ValueQualityAnalyzer) AbstractFrequencyAnalyzer(org.talend.dataquality.statistics.frequency.AbstractFrequencyAnalyzer) SemanticQualityAnalyzer(org.talend.dataquality.semantic.statistics.SemanticQualityAnalyzer) StreamNumberHistogramAnalyzer(org.talend.dataprep.api.dataset.statistics.number.StreamNumberHistogramAnalyzer) DataTypeFrequencyAnalyzer(org.talend.dataquality.statistics.frequency.DataTypeFrequencyAnalyzer) QuantileAnalyzer(org.talend.dataquality.statistics.numeric.quantile.QuantileAnalyzer) StreamDateHistogramAnalyzer(org.talend.dataprep.api.dataset.statistics.date.StreamDateHistogramAnalyzer) NullAnalyzer(org.talend.dataprep.transformation.api.transformer.json.NullAnalyzer) CardinalityAnalyzer(org.talend.dataquality.statistics.cardinality.CardinalityAnalyzer) DataTypeAnalyzer(org.talend.dataquality.statistics.type.DataTypeAnalyzer) SummaryAnalyzer(org.talend.dataquality.statistics.numeric.summary.SummaryAnalyzer) SemanticAnalyzer(org.talend.dataquality.semantic.statistics.SemanticAnalyzer) Analyzer(org.talend.dataquality.common.inference.Analyzer) DataTypeQualityAnalyzer(org.talend.dataquality.statistics.quality.DataTypeQualityAnalyzer) CompositePatternFrequencyAnalyzer(org.talend.dataquality.statistics.frequency.pattern.CompositePatternFrequencyAnalyzer) DataTypeQualityAnalyzer(org.talend.dataquality.statistics.quality.DataTypeQualityAnalyzer) DataTypeEnum(org.talend.dataquality.statistics.type.DataTypeEnum) StreamNumberHistogramAnalyzer(org.talend.dataprep.api.dataset.statistics.number.StreamNumberHistogramAnalyzer) SummaryAnalyzer(org.talend.dataquality.statistics.numeric.summary.SummaryAnalyzer) CardinalityAnalyzer(org.talend.dataquality.statistics.cardinality.CardinalityAnalyzer) DataTypeFrequencyAnalyzer(org.talend.dataquality.statistics.frequency.DataTypeFrequencyAnalyzer) TextLengthAnalyzer(org.talend.dataquality.statistics.text.TextLengthAnalyzer) SemanticAnalyzer(org.talend.dataquality.semantic.statistics.SemanticAnalyzer) QuantileAnalyzer(org.talend.dataquality.statistics.numeric.quantile.QuantileAnalyzer) ValueQualityAnalyzer(org.talend.dataquality.statistics.quality.ValueQualityAnalyzer) SemanticQualityAnalyzer(org.talend.dataquality.semantic.statistics.SemanticQualityAnalyzer) StreamDateHistogramAnalyzer(org.talend.dataprep.api.dataset.statistics.date.StreamDateHistogramAnalyzer) DictionarySnapshot(org.talend.dataquality.semantic.snapshot.DictionarySnapshot) DataTypeAnalyzer(org.talend.dataquality.statistics.type.DataTypeAnalyzer)

Example 4 with Analyzer

use of org.talend.dataquality.common.inference.Analyzer in project data-prep by Talend.

the class DataSetService method getDataSetColumnSemanticCategories.

/**
 * Return the semantic types for a given dataset / column.
 *
 * @param datasetId the datasetId id.
 * @param columnId the column id.
 * @return the semantic types for a given dataset / column.
 */
@RequestMapping(value = "/datasets/{datasetId}/columns/{columnId}/types", method = GET)
@ApiOperation(value = "list the types of the wanted column", notes = "This list can be used by user to change the column type.")
@Timed
@PublicAPI
public List<SemanticDomain> getDataSetColumnSemanticCategories(@ApiParam(value = "The dataset id") @PathVariable String datasetId, @ApiParam(value = "The column id") @PathVariable String columnId) {
    LOG.debug("listing semantic categories for dataset #{} column #{}", datasetId, columnId);
    final DataSetMetadata metadata = dataSetMetadataRepository.get(datasetId);
    if (metadata == null) {
        throw new TDPException(DataSetErrorCodes.DATASET_DOES_NOT_EXIST, ExceptionContext.withBuilder().put("id", datasetId).build());
    } else {
        try (final Stream<DataSetRow> records = contentStore.stream(metadata)) {
            final ColumnMetadata columnMetadata = metadata.getRowMetadata().getById(columnId);
            final Analyzer<Analyzers.Result> analyzer = analyzerService.build(columnMetadata, SEMANTIC);
            analyzer.init();
            records.map(r -> r.get(columnId)).forEach(analyzer::analyze);
            analyzer.end();
            final List<Analyzers.Result> analyzerResult = analyzer.getResult();
            final StatisticsAdapter statisticsAdapter = new StatisticsAdapter(40);
            statisticsAdapter.adapt(singletonList(columnMetadata), analyzerResult);
            LOG.debug("found {} for dataset #{}, column #{}", columnMetadata.getSemanticDomains(), datasetId, columnId);
            return columnMetadata.getSemanticDomains();
        }
    }
}
Also used : TDPException(org.talend.dataprep.exception.TDPException) VolumeMetered(org.talend.dataprep.metrics.VolumeMetered) RequestParam(org.springframework.web.bind.annotation.RequestParam) ImportBuilder(org.talend.dataprep.api.dataset.Import.ImportBuilder) FormatFamilyFactory(org.talend.dataprep.schema.FormatFamilyFactory) Autowired(org.springframework.beans.factory.annotation.Autowired) ApiParam(io.swagger.annotations.ApiParam) StringUtils(org.apache.commons.lang3.StringUtils) TEXT_PLAIN_VALUE(org.springframework.http.MediaType.TEXT_PLAIN_VALUE) SortAndOrderHelper.getDataSetMetadataComparator(org.talend.dataprep.util.SortAndOrderHelper.getDataSetMetadataComparator) Collections.singletonList(java.util.Collections.singletonList) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) BeanConversionService(org.talend.dataprep.conversions.BeanConversionService) PipedInputStream(java.io.PipedInputStream) DistributedLock(org.talend.dataprep.lock.DistributedLock) Arrays.asList(java.util.Arrays.asList) Map(java.util.Map) DataprepBundle.message(org.talend.dataprep.i18n.DataprepBundle.message) UserData(org.talend.dataprep.api.user.UserData) TaskExecutor(org.springframework.core.task.TaskExecutor) MAX_STORAGE_MAY_BE_EXCEEDED(org.talend.dataprep.exception.error.DataSetErrorCodes.MAX_STORAGE_MAY_BE_EXCEEDED) DataSet(org.talend.dataprep.api.dataset.DataSet) LocalStoreLocation(org.talend.dataprep.api.dataset.location.LocalStoreLocation) FormatFamily(org.talend.dataprep.schema.FormatFamily) Resource(javax.annotation.Resource) Set(java.util.Set) DatasetUpdatedEvent(org.talend.dataprep.dataset.event.DatasetUpdatedEvent) RestController(org.springframework.web.bind.annotation.RestController) QuotaService(org.talend.dataprep.dataset.store.QuotaService) Stream(java.util.stream.Stream) StreamSupport.stream(java.util.stream.StreamSupport.stream) FlagNames(org.talend.dataprep.api.dataset.row.FlagNames) UNEXPECTED_CONTENT(org.talend.dataprep.exception.error.CommonErrorCodes.UNEXPECTED_CONTENT) Analyzers(org.talend.dataquality.common.inference.Analyzers) DataSetLocatorService(org.talend.dataprep.api.dataset.location.locator.DataSetLocatorService) Callable(java.util.concurrent.Callable) Schema(org.talend.dataprep.schema.Schema) ArrayList(java.util.ArrayList) Value(org.springframework.beans.factory.annotation.Value) RequestBody(org.springframework.web.bind.annotation.RequestBody) DataSetLocationService(org.talend.dataprep.api.dataset.location.DataSetLocationService) AnalyzerService(org.talend.dataprep.quality.AnalyzerService) UserDataRepository(org.talend.dataprep.user.store.UserDataRepository) Markers(org.talend.dataprep.log.Markers) Api(io.swagger.annotations.Api) DraftValidator(org.talend.dataprep.schema.DraftValidator) HttpResponseContext(org.talend.dataprep.http.HttpResponseContext) Sort(org.talend.dataprep.util.SortAndOrderHelper.Sort) IOException(java.io.IOException) PipedOutputStream(java.io.PipedOutputStream) FormatAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.FormatAnalysis) ContentAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.ContentAnalysis) SchemaAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.SchemaAnalysis) HttpStatus(org.springframework.http.HttpStatus) FilterService(org.talend.dataprep.api.filter.FilterService) Marker(org.slf4j.Marker) NullOutputStream(org.apache.commons.io.output.NullOutputStream) StatisticsAdapter(org.talend.dataprep.dataset.StatisticsAdapter) Timed(org.talend.dataprep.metrics.Timed) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) PathVariable(org.springframework.web.bind.annotation.PathVariable) DataSetMetadataBuilder(org.talend.dataprep.dataset.DataSetMetadataBuilder) URLDecoder(java.net.URLDecoder) DataSetErrorCodes(org.talend.dataprep.exception.error.DataSetErrorCodes) PUT(org.springframework.web.bind.annotation.RequestMethod.PUT) LoggerFactory(org.slf4j.LoggerFactory) SEMANTIC(org.talend.dataprep.quality.AnalyzerService.Analysis.SEMANTIC) ApiOperation(io.swagger.annotations.ApiOperation) UNABLE_TO_CREATE_OR_UPDATE_DATASET(org.talend.dataprep.exception.error.DataSetErrorCodes.UNABLE_TO_CREATE_OR_UPDATE_DATASET) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) StrictlyBoundedInputStream(org.talend.dataprep.dataset.store.content.StrictlyBoundedInputStream) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) UNSUPPORTED_CONTENT(org.talend.dataprep.exception.error.DataSetErrorCodes.UNSUPPORTED_CONTENT) TimeToLive(org.talend.dataprep.cache.ContentCache.TimeToLive) Order(org.talend.dataprep.util.SortAndOrderHelper.Order) Collections.emptyList(java.util.Collections.emptyList) PublicAPI(org.talend.dataprep.security.PublicAPI) RequestMethod(org.springframework.web.bind.annotation.RequestMethod) UUID(java.util.UUID) Collectors(java.util.stream.Collectors) ContentCache(org.talend.dataprep.cache.ContentCache) INVALID_DATASET_NAME(org.talend.dataprep.exception.error.DataSetErrorCodes.INVALID_DATASET_NAME) List(java.util.List) Optional(java.util.Optional) Analyzer(org.talend.dataquality.common.inference.Analyzer) RequestHeader(org.springframework.web.bind.annotation.RequestHeader) Pattern(java.util.regex.Pattern) Security(org.talend.dataprep.security.Security) Spliterator(java.util.Spliterator) RowMetadata(org.talend.dataprep.api.dataset.RowMetadata) ComponentProperties(org.talend.dataprep.parameters.jsonschema.ComponentProperties) TDPException(org.talend.dataprep.exception.TDPException) JsonErrorCodeDescription(org.talend.dataprep.exception.json.JsonErrorCodeDescription) RequestMapping(org.springframework.web.bind.annotation.RequestMapping) UNABLE_CREATE_DATASET(org.talend.dataprep.exception.error.DataSetErrorCodes.UNABLE_CREATE_DATASET) HashMap(java.util.HashMap) GET(org.springframework.web.bind.annotation.RequestMethod.GET) Import(org.talend.dataprep.api.dataset.Import) ExceptionContext.build(org.talend.daikon.exception.ExceptionContext.build) ExceptionContext(org.talend.daikon.exception.ExceptionContext) Charset(java.nio.charset.Charset) UpdateColumnParameters(org.talend.dataprep.dataset.service.api.UpdateColumnParameters) VersionService(org.talend.dataprep.api.service.info.VersionService) POST(org.springframework.web.bind.annotation.RequestMethod.POST) OutputStream(java.io.OutputStream) DataSetLocation(org.talend.dataprep.api.dataset.DataSetLocation) Logger(org.slf4j.Logger) LocaleContextHolder.getLocale(org.springframework.context.i18n.LocaleContextHolder.getLocale) UpdateDataSetCacheKey(org.talend.dataprep.dataset.service.cache.UpdateDataSetCacheKey) IOUtils(org.apache.commons.compress.utils.IOUtils) APPLICATION_JSON_VALUE(org.springframework.http.MediaType.APPLICATION_JSON_VALUE) ResponseBody(org.springframework.web.bind.annotation.ResponseBody) Certification(org.talend.dataprep.api.dataset.DataSetGovernance.Certification) EncodingSupport(org.talend.dataprep.configuration.EncodingSupport) Comparator(java.util.Comparator) InputStream(java.io.InputStream) StatisticsAdapter(org.talend.dataprep.dataset.StatisticsAdapter) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) Timed(org.talend.dataprep.metrics.Timed) ApiOperation(io.swagger.annotations.ApiOperation) PublicAPI(org.talend.dataprep.security.PublicAPI) RequestMapping(org.springframework.web.bind.annotation.RequestMapping)

Example 5 with Analyzer

use of org.talend.dataquality.common.inference.Analyzer in project data-prep by Talend.

the class QualityAnalysis method computeQuality.

/**
 * Compute the quality (count, valid, invalid and empty) of the given dataset.
 *
 * @param dataset the dataset metadata.
 * @param records the dataset records
 * @param limit indicates how many records will be read from stream. Use a number < 0 to perform a full scan of
 */
public void computeQuality(DataSetMetadata dataset, Stream<DataSetRow> records, long limit) {
    // Compute valid / invalid / empty count, need data types for analyzer first
    final List<ColumnMetadata> columns = dataset.getRowMetadata().getColumns();
    if (columns.isEmpty()) {
        LOGGER.debug("Skip analysis of {} (no column information).", dataset.getId());
        return;
    }
    try (Analyzer<Analyzers.Result> analyzer = analyzerService.qualityAnalysis(columns)) {
        if (limit > 0) {
            // Only limit number of rows if limit > 0 (use limit to speed up sync analysis.
            LOGGER.debug("Limit analysis to the first {}.", limit);
            records = records.limit(limit);
        } else {
            LOGGER.debug("Performing full analysis.");
        }
        records.map(row -> row.toArray(DataSetRow.SKIP_TDP_ID)).forEach(analyzer::analyze);
        // Determine content size
        final List<Analyzers.Result> result = analyzer.getResult();
        adapter.adapt(columns, result);
        // Remember the number of records
        if (!result.isEmpty()) {
            final long recordCount = result.get(0).get(ValueQualityStatistics.class).getCount();
            dataset.getContent().setNbRecords((int) recordCount);
        }
    } catch (Exception e) {
        throw new TDPException(CommonErrorCodes.UNEXPECTED_EXCEPTION, e);
    }
}
Also used : Analyzers(org.talend.dataquality.common.inference.Analyzers) StringUtils(org.apache.commons.lang.StringUtils) TDPException(org.talend.dataprep.exception.TDPException) Logger(org.slf4j.Logger) DataSetErrorCodes(org.talend.dataprep.exception.error.DataSetErrorCodes) DataSetMetadataRepository(org.talend.dataprep.dataset.store.metadata.DataSetMetadataRepository) LoggerFactory(org.slf4j.LoggerFactory) Autowired(org.springframework.beans.factory.annotation.Autowired) Value(org.springframework.beans.factory.annotation.Value) AnalyzerService(org.talend.dataprep.quality.AnalyzerService) ValueQualityStatistics(org.talend.dataquality.common.inference.ValueQualityStatistics) List(java.util.List) Component(org.springframework.stereotype.Component) Stream(java.util.stream.Stream) DistributedLock(org.talend.dataprep.lock.DistributedLock) StatisticsAdapter(org.talend.dataprep.dataset.StatisticsAdapter) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) CommonErrorCodes(org.talend.dataprep.exception.error.CommonErrorCodes) Analyzer(org.talend.dataquality.common.inference.Analyzer) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) ContentStoreRouter(org.talend.dataprep.dataset.store.content.ContentStoreRouter) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) TDPException(org.talend.dataprep.exception.TDPException) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) ValueQualityStatistics(org.talend.dataquality.common.inference.ValueQualityStatistics) TDPException(org.talend.dataprep.exception.TDPException)

Aggregations

ColumnMetadata (org.talend.dataprep.api.dataset.ColumnMetadata)6 Analyzer (org.talend.dataquality.common.inference.Analyzer)6 Analyzers (org.talend.dataquality.common.inference.Analyzers)6 Stream (java.util.stream.Stream)5 Logger (org.slf4j.Logger)5 LoggerFactory (org.slf4j.LoggerFactory)5 Autowired (org.springframework.beans.factory.annotation.Autowired)5 DataSetMetadata (org.talend.dataprep.api.dataset.DataSetMetadata)5 TDPException (org.talend.dataprep.exception.TDPException)5 AnalyzerService (org.talend.dataprep.quality.AnalyzerService)5 List (java.util.List)4 StringUtils (org.apache.commons.lang.StringUtils)4 Collectors (java.util.stream.Collectors)3 Value (org.springframework.beans.factory.annotation.Value)3 DataSetRow (org.talend.dataprep.api.dataset.row.DataSetRow)3 StatisticsAdapter (org.talend.dataprep.dataset.StatisticsAdapter)3 DistributedLock (org.talend.dataprep.lock.DistributedLock)3 Api (io.swagger.annotations.Api)2 ApiOperation (io.swagger.annotations.ApiOperation)2 ApiParam (io.swagger.annotations.ApiParam)2