Search in sources :

Example 1 with CategoryFrequency

use of org.talend.dataquality.semantic.recognizer.CategoryFrequency in project data-prep by Talend.

the class StatisticsAdapter method injectSemanticTypes.

private void injectSemanticTypes(final ColumnMetadata column, final Analyzers.Result result) {
    if (result.exist(SemanticType.class) && !column.isDomainForced()) {
        final SemanticType semanticType = result.get(SemanticType.class);
        final List<CategoryFrequency> suggestedTypes = semanticType.getSuggestedCategories();
        // TDP-471: Don't pick semantic type if lower than a threshold.
        final Optional<CategoryFrequency> bestMatch = // 
        suggestedTypes.stream().filter(// 
        e -> !e.getCategoryName().isEmpty()).findFirst();
        if (bestMatch.isPresent()) {
            // TODO (TDP-734) Take into account limit of the semantic analyzer.
            final float score = bestMatch.get().getScore();
            if (score > semanticThreshold) {
                updateMetadataWithCategoryInfo(column, bestMatch.get());
            } else {
                // Ensure the domain is cleared if score is lower than threshold (earlier analysis - e.g.
                // on the first 20 lines - may be over threshold, but full scan may decide otherwise.
                resetDomain(column);
            }
        } else if (StringUtils.isNotEmpty(column.getDomain())) {
            // Column *had* a domain but seems like new analysis removed it.
            resetDomain(column);
        }
        // Keep all suggested semantic categories in the column metadata
        List<SemanticDomain> semanticDomains = // 
        suggestedTypes.stream().map(// 
        this::toSemanticDomain).filter(// 
        semanticDomain -> semanticDomain != null && semanticDomain.getScore() >= 1).limit(// 
        10).collect(Collectors.toList());
        column.setSemanticDomains(semanticDomains);
    }
}
Also used : Analyzers(org.talend.dataquality.common.inference.Analyzers) java.util(java.util) StringUtils(org.apache.commons.lang.StringUtils) DateHistogram(org.talend.dataprep.api.dataset.statistics.date.DateHistogram) CardinalityStatistics(org.talend.dataquality.statistics.cardinality.CardinalityStatistics) TypeUtils(org.talend.dataprep.api.type.TypeUtils) DataTypeFrequencyStatistics(org.talend.dataquality.statistics.frequency.DataTypeFrequencyStatistics) LoggerFactory(org.slf4j.LoggerFactory) Quality(org.talend.dataprep.api.dataset.Quality) StreamNumberHistogramStatistics(org.talend.dataprep.api.dataset.statistics.number.StreamNumberHistogramStatistics) NumberFormat(java.text.NumberFormat) org.talend.dataprep.api.dataset.statistics(org.talend.dataprep.api.dataset.statistics) DataTypeEnum(org.talend.dataquality.statistics.type.DataTypeEnum) ValueQualityStatistics(org.talend.dataquality.common.inference.ValueQualityStatistics) SummaryStatistics(org.talend.dataquality.statistics.numeric.summary.SummaryStatistics) CategoryFrequency(org.talend.dataquality.semantic.recognizer.CategoryFrequency) SemanticType(org.talend.dataquality.semantic.statistics.SemanticType) NumberHistogram(org.talend.dataprep.api.dataset.statistics.number.NumberHistogram) Logger(org.slf4j.Logger) Predicate(java.util.function.Predicate) DQCategory(org.talend.dataquality.semantic.model.DQCategory) DecimalFormat(java.text.DecimalFormat) StreamDateHistogramStatistics(org.talend.dataprep.api.dataset.statistics.date.StreamDateHistogramStatistics) TextLengthStatistics(org.talend.dataquality.statistics.text.TextLengthStatistics) Collectors(java.util.stream.Collectors) PatternFrequencyStatistics(org.talend.dataquality.statistics.frequency.pattern.PatternFrequencyStatistics) Type(org.talend.dataprep.api.type.Type) CategoryRegistryManager(org.talend.dataquality.semantic.api.CategoryRegistryManager) QuantileStatistics(org.talend.dataquality.statistics.numeric.quantile.QuantileStatistics) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataTypeOccurences(org.talend.dataquality.statistics.type.DataTypeOccurences) SemanticType(org.talend.dataquality.semantic.statistics.SemanticType) CategoryFrequency(org.talend.dataquality.semantic.recognizer.CategoryFrequency)

Example 2 with CategoryFrequency

use of org.talend.dataquality.semantic.recognizer.CategoryFrequency in project data-prep by Talend.

the class TypeUtilsTest method testSemanticDomainType.

@Test
public void testSemanticDomainType() throws Exception {
    final SemanticType semanticType = new SemanticType();
    semanticType.increment(new CategoryFrequency(SemanticCategoryEnum.AIRPORT.getId(), SemanticCategoryEnum.AIRPORT.getId()), 1);
    assertThat(TypeUtils.getDomainLabel(SemanticCategoryEnum.AIRPORT.getId()), is(SemanticCategoryEnum.AIRPORT.getDisplayName()));
}
Also used : SemanticType(org.talend.dataquality.semantic.statistics.SemanticType) CategoryFrequency(org.talend.dataquality.semantic.recognizer.CategoryFrequency) Test(org.junit.Test)

Example 3 with CategoryFrequency

use of org.talend.dataquality.semantic.recognizer.CategoryFrequency in project data-prep by Talend.

the class StatisticsUtilsTest method adaptColumn.

private void adaptColumn(final ColumnMetadata column, final DataTypeEnum type) {
    Analyzers.Result result = new Analyzers.Result();
    // Data type
    DataTypeOccurences dataType = new DataTypeOccurences();
    dataType.increment(type);
    result.add(dataType);
    // Semantic type
    SemanticType semanticType = new SemanticType();
    CategoryFrequency category1 = new CategoryFrequency("category 1", "category 1");
    category1.setScore(99);
    semanticType.increment(category1, 1);
    result.add(semanticType);
    // Suggested types
    CategoryFrequency category2 = new CategoryFrequency("category 2", "category 2");
    category2.setScore(81);
    semanticType.increment(category2, 1);
    CategoryFrequency category3 = new CategoryFrequency("category 3", "category 3");
    category3.setScore(50);
    semanticType.increment(category3, 1);
    // Value quality
    ValueQualityStatistics valueQualityStatistics = new ValueQualityStatistics();
    valueQualityStatistics.setEmptyCount(10);
    valueQualityStatistics.setInvalidCount(20);
    valueQualityStatistics.setValidCount(30);
    result.add(valueQualityStatistics);
    // Cardinality
    CardinalityStatistics cardinalityStatistics = new CardinalityStatistics();
    cardinalityStatistics.incrementCount();
    cardinalityStatistics.add("distinctValue");
    result.add(cardinalityStatistics);
    // Data frequency
    DataTypeFrequencyStatistics dataFrequencyStatistics = new DataTypeFrequencyStatistics();
    dataFrequencyStatistics.add("frequentValue1");
    dataFrequencyStatistics.add("frequentValue1");
    dataFrequencyStatistics.add("frequentValue2");
    dataFrequencyStatistics.add("frequentValue2");
    result.add(dataFrequencyStatistics);
    // Pattern frequency
    PatternFrequencyStatistics patternFrequencyStatistics = new PatternFrequencyStatistics();
    patternFrequencyStatistics.add("999a999");
    patternFrequencyStatistics.add("999a999");
    patternFrequencyStatistics.add("999aaaa");
    patternFrequencyStatistics.add("999aaaa");
    result.add(patternFrequencyStatistics);
    // Quantiles
    QuantileStatistics quantileStatistics = new QuantileStatistics();
    quantileStatistics.add(1d);
    quantileStatistics.add(2d);
    quantileStatistics.endAddValue();
    result.add(quantileStatistics);
    // Summary
    SummaryStatistics summaryStatistics = new SummaryStatistics();
    summaryStatistics.addData(1d);
    summaryStatistics.addData(2d);
    result.add(summaryStatistics);
    // Histogram
    StreamNumberHistogramStatistics histogramStatistics = new StreamNumberHistogramStatistics();
    histogramStatistics.setNumberOfBins(2);
    histogramStatistics.add(1);
    histogramStatistics.add(2);
    result.add(histogramStatistics);
    // Text length
    TextLengthStatistics textLengthStatistics = new TextLengthStatistics();
    textLengthStatistics.setMaxTextLength(30);
    textLengthStatistics.setMinTextLength(10);
    textLengthStatistics.setSumTextLength(40);
    textLengthStatistics.setCount(5);
    result.add(textLengthStatistics);
    StatisticsAdapter adapter = new StatisticsAdapter(40);
    adapter.adapt(Collections.singletonList(integerColumn), Collections.singletonList(result));
    adapter.adapt(Collections.singletonList(stringColumn), Collections.singletonList(result));
}
Also used : SemanticType(org.talend.dataquality.semantic.statistics.SemanticType) DataTypeFrequencyStatistics(org.talend.dataquality.statistics.frequency.DataTypeFrequencyStatistics) Analyzers(org.talend.dataquality.common.inference.Analyzers) CategoryFrequency(org.talend.dataquality.semantic.recognizer.CategoryFrequency) ValueQualityStatistics(org.talend.dataquality.common.inference.ValueQualityStatistics) SummaryStatistics(org.talend.dataquality.statistics.numeric.summary.SummaryStatistics) QuantileStatistics(org.talend.dataquality.statistics.numeric.quantile.QuantileStatistics) TextLengthStatistics(org.talend.dataquality.statistics.text.TextLengthStatistics) StatisticsAdapter(org.talend.dataprep.dataset.StatisticsAdapter) CardinalityStatistics(org.talend.dataquality.statistics.cardinality.CardinalityStatistics) StreamNumberHistogramStatistics(org.talend.dataprep.api.dataset.statistics.number.StreamNumberHistogramStatistics) DataTypeOccurences(org.talend.dataquality.statistics.type.DataTypeOccurences) PatternFrequencyStatistics(org.talend.dataquality.statistics.frequency.pattern.PatternFrequencyStatistics)

Aggregations

CategoryFrequency (org.talend.dataquality.semantic.recognizer.CategoryFrequency)3 SemanticType (org.talend.dataquality.semantic.statistics.SemanticType)3 StreamNumberHistogramStatistics (org.talend.dataprep.api.dataset.statistics.number.StreamNumberHistogramStatistics)2 Analyzers (org.talend.dataquality.common.inference.Analyzers)2 ValueQualityStatistics (org.talend.dataquality.common.inference.ValueQualityStatistics)2 CardinalityStatistics (org.talend.dataquality.statistics.cardinality.CardinalityStatistics)2 DataTypeFrequencyStatistics (org.talend.dataquality.statistics.frequency.DataTypeFrequencyStatistics)2 PatternFrequencyStatistics (org.talend.dataquality.statistics.frequency.pattern.PatternFrequencyStatistics)2 QuantileStatistics (org.talend.dataquality.statistics.numeric.quantile.QuantileStatistics)2 SummaryStatistics (org.talend.dataquality.statistics.numeric.summary.SummaryStatistics)2 TextLengthStatistics (org.talend.dataquality.statistics.text.TextLengthStatistics)2 DataTypeOccurences (org.talend.dataquality.statistics.type.DataTypeOccurences)2 DecimalFormat (java.text.DecimalFormat)1 NumberFormat (java.text.NumberFormat)1 java.util (java.util)1 Predicate (java.util.function.Predicate)1 Collectors (java.util.stream.Collectors)1 StringUtils (org.apache.commons.lang.StringUtils)1 Test (org.junit.Test)1 Logger (org.slf4j.Logger)1