Search in sources :

Example 1 with Type

use of org.talend.dataprep.api.type.Type in project data-prep by Talend.

the class StatisticsAdapter method injectSemanticTypes.

private void injectSemanticTypes(final ColumnMetadata column, final Analyzers.Result result) {
    if (result.exist(SemanticType.class) && !column.isDomainForced()) {
        final SemanticType semanticType = result.get(SemanticType.class);
        final List<CategoryFrequency> suggestedTypes = semanticType.getSuggestedCategories();
        // TDP-471: Don't pick semantic type if lower than a threshold.
        final Optional<CategoryFrequency> bestMatch = // 
        suggestedTypes.stream().filter(// 
        e -> !e.getCategoryName().isEmpty()).findFirst();
        if (bestMatch.isPresent()) {
            // TODO (TDP-734) Take into account limit of the semantic analyzer.
            final float score = bestMatch.get().getScore();
            if (score > semanticThreshold) {
                updateMetadataWithCategoryInfo(column, bestMatch.get());
            } else {
                // Ensure the domain is cleared if score is lower than threshold (earlier analysis - e.g.
                // on the first 20 lines - may be over threshold, but full scan may decide otherwise.
                resetDomain(column);
            }
        } else if (StringUtils.isNotEmpty(column.getDomain())) {
            // Column *had* a domain but seems like new analysis removed it.
            resetDomain(column);
        }
        // Keep all suggested semantic categories in the column metadata
        List<SemanticDomain> semanticDomains = // 
        suggestedTypes.stream().map(// 
        this::toSemanticDomain).filter(// 
        semanticDomain -> semanticDomain != null && semanticDomain.getScore() >= 1).limit(// 
        10).collect(Collectors.toList());
        column.setSemanticDomains(semanticDomains);
    }
}
Also used : Analyzers(org.talend.dataquality.common.inference.Analyzers) java.util(java.util) StringUtils(org.apache.commons.lang.StringUtils) DateHistogram(org.talend.dataprep.api.dataset.statistics.date.DateHistogram) CardinalityStatistics(org.talend.dataquality.statistics.cardinality.CardinalityStatistics) TypeUtils(org.talend.dataprep.api.type.TypeUtils) DataTypeFrequencyStatistics(org.talend.dataquality.statistics.frequency.DataTypeFrequencyStatistics) LoggerFactory(org.slf4j.LoggerFactory) Quality(org.talend.dataprep.api.dataset.Quality) StreamNumberHistogramStatistics(org.talend.dataprep.api.dataset.statistics.number.StreamNumberHistogramStatistics) NumberFormat(java.text.NumberFormat) org.talend.dataprep.api.dataset.statistics(org.talend.dataprep.api.dataset.statistics) DataTypeEnum(org.talend.dataquality.statistics.type.DataTypeEnum) ValueQualityStatistics(org.talend.dataquality.common.inference.ValueQualityStatistics) SummaryStatistics(org.talend.dataquality.statistics.numeric.summary.SummaryStatistics) CategoryFrequency(org.talend.dataquality.semantic.recognizer.CategoryFrequency) SemanticType(org.talend.dataquality.semantic.statistics.SemanticType) NumberHistogram(org.talend.dataprep.api.dataset.statistics.number.NumberHistogram) Logger(org.slf4j.Logger) Predicate(java.util.function.Predicate) DQCategory(org.talend.dataquality.semantic.model.DQCategory) DecimalFormat(java.text.DecimalFormat) StreamDateHistogramStatistics(org.talend.dataprep.api.dataset.statistics.date.StreamDateHistogramStatistics) TextLengthStatistics(org.talend.dataquality.statistics.text.TextLengthStatistics) Collectors(java.util.stream.Collectors) PatternFrequencyStatistics(org.talend.dataquality.statistics.frequency.pattern.PatternFrequencyStatistics) Type(org.talend.dataprep.api.type.Type) CategoryRegistryManager(org.talend.dataquality.semantic.api.CategoryRegistryManager) QuantileStatistics(org.talend.dataquality.statistics.numeric.quantile.QuantileStatistics) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataTypeOccurences(org.talend.dataquality.statistics.type.DataTypeOccurences) SemanticType(org.talend.dataquality.semantic.statistics.SemanticType) CategoryFrequency(org.talend.dataquality.semantic.recognizer.CategoryFrequency)

Example 2 with Type

use of org.talend.dataprep.api.type.Type in project data-prep by Talend.

the class ChangeDatePatternTest method should_set_new_pattern_as_most_used_one_newcolumn.

@Test
public void should_set_new_pattern_as_most_used_one_newcolumn() throws Exception {
    // given
    final DataSetRow row = // 
    builder().with(// 
    value("toto").type(Type.STRING).name("recipe")).with(// 
    value("04/25/1999").type(Type.DATE).name("recipe").statistics(getDateTestJsonAsStream("statistics_MM_dd_yyyy.json"))).with(// 
    value("tata").type(Type.STRING).name("last update")).build();
    parameters.put(CREATE_NEW_COLUMN, "true");
    // when
    ActionTestWorkbench.test(row, actionRegistry, factory.create(action, parameters));
    // then
    final List<PatternFrequency> patternFrequencies = // 
    row.getRowMetadata().getById(// 
    "0003").getStatistics().getPatternFrequencies();
    String newPattern = parameters.get("new_pattern");
    final Optional<PatternFrequency> newPatternSet = // 
    patternFrequencies.stream().filter(// 
    p -> StringUtils.equals(newPattern, p.getPattern())).findFirst();
    assertTrue(newPatternSet.isPresent());
    assertEquals(newPatternSet.get().getOccurrences(), 48);
}
Also used : CoreMatchers.is(org.hamcrest.CoreMatchers.is) ImplicitParameters(org.talend.dataprep.transformation.actions.common.ImplicitParameters) StringUtils(org.apache.commons.lang.StringUtils) Arrays(java.util.Arrays) CREATE_NEW_COLUMN(org.talend.dataprep.transformation.actions.common.ActionsUtils.CREATE_NEW_COLUMN) HashMap(java.util.HashMap) ValueBuilder.value(org.talend.dataprep.transformation.actions.AbstractMetadataBaseTest.ValueBuilder.value) ActionMetadataTestUtils.getRow(org.talend.dataprep.transformation.actions.ActionMetadataTestUtils.getRow) ActionMetadataTestUtils.getColumn(org.talend.dataprep.transformation.actions.ActionMetadataTestUtils.getColumn) Assert.assertThat(org.junit.Assert.assertThat) ActionTestWorkbench(org.talend.dataprep.transformation.api.action.ActionTestWorkbench) ActionCategory(org.talend.dataprep.transformation.actions.category.ActionCategory) Locale(java.util.Locale) Map(java.util.Map) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) PatternFrequency(org.talend.dataprep.api.dataset.statistics.PatternFrequency) Before(org.junit.Before) TalendRuntimeException(org.talend.daikon.exception.TalendRuntimeException) ActionMetadataTestUtils(org.talend.dataprep.transformation.actions.ActionMetadataTestUtils) Assert.assertTrue(org.junit.Assert.assertTrue) Test(org.junit.Test) IOException(java.io.IOException) ValuesBuilder.builder(org.talend.dataprep.transformation.actions.AbstractMetadataBaseTest.ValuesBuilder.builder) Type(org.talend.dataprep.api.type.Type) List(java.util.List) Builder.column(org.talend.dataprep.api.dataset.ColumnMetadata.Builder.column) Assert.assertFalse(org.junit.Assert.assertFalse) Optional(java.util.Optional) ActionMetadataTestUtils.setStatistics(org.talend.dataprep.transformation.actions.ActionMetadataTestUtils.setStatistics) ActionDefinition(org.talend.dataprep.api.action.ActionDefinition) SelectParameter(org.talend.dataprep.parameters.SelectParameter) Collections(java.util.Collections) Assert.assertEquals(org.junit.Assert.assertEquals) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) PatternFrequency(org.talend.dataprep.api.dataset.statistics.PatternFrequency) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) Test(org.junit.Test)

Example 3 with Type

use of org.talend.dataprep.api.type.Type in project data-prep by Talend.

the class ChangeDatePatternTest method should_set_new_pattern_as_most_used_one.

@Test
public void should_set_new_pattern_as_most_used_one() throws Exception {
    // given
    final DataSetRow row = // 
    builder().with(// 
    value("toto").type(Type.STRING).name("tips")).with(// 
    value("04/25/1999").type(Type.DATE).name("date").statistics(getDateTestJsonAsStream("statistics_MM_dd_yyyy.json"))).with(// 
    value("tata").type(Type.STRING).name("test")).build();
    // when
    ActionTestWorkbench.test(row, actionRegistry, factory.create(action, parameters));
    // then
    final List<PatternFrequency> patternFrequencies = // 
    row.getRowMetadata().getById(// 
    "0001").getStatistics().getPatternFrequencies();
    String newPattern = parameters.get("new_pattern");
    final Optional<PatternFrequency> newPatternSet = // 
    patternFrequencies.stream().filter(// 
    p -> StringUtils.equals(newPattern, p.getPattern())).findFirst();
    assertTrue(newPatternSet.isPresent());
    assertEquals(newPatternSet.get().getOccurrences(), 48);
}
Also used : CoreMatchers.is(org.hamcrest.CoreMatchers.is) ImplicitParameters(org.talend.dataprep.transformation.actions.common.ImplicitParameters) StringUtils(org.apache.commons.lang.StringUtils) Arrays(java.util.Arrays) CREATE_NEW_COLUMN(org.talend.dataprep.transformation.actions.common.ActionsUtils.CREATE_NEW_COLUMN) HashMap(java.util.HashMap) ValueBuilder.value(org.talend.dataprep.transformation.actions.AbstractMetadataBaseTest.ValueBuilder.value) ActionMetadataTestUtils.getRow(org.talend.dataprep.transformation.actions.ActionMetadataTestUtils.getRow) ActionMetadataTestUtils.getColumn(org.talend.dataprep.transformation.actions.ActionMetadataTestUtils.getColumn) Assert.assertThat(org.junit.Assert.assertThat) ActionTestWorkbench(org.talend.dataprep.transformation.api.action.ActionTestWorkbench) ActionCategory(org.talend.dataprep.transformation.actions.category.ActionCategory) Locale(java.util.Locale) Map(java.util.Map) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) PatternFrequency(org.talend.dataprep.api.dataset.statistics.PatternFrequency) Before(org.junit.Before) TalendRuntimeException(org.talend.daikon.exception.TalendRuntimeException) ActionMetadataTestUtils(org.talend.dataprep.transformation.actions.ActionMetadataTestUtils) Assert.assertTrue(org.junit.Assert.assertTrue) Test(org.junit.Test) IOException(java.io.IOException) ValuesBuilder.builder(org.talend.dataprep.transformation.actions.AbstractMetadataBaseTest.ValuesBuilder.builder) Type(org.talend.dataprep.api.type.Type) List(java.util.List) Builder.column(org.talend.dataprep.api.dataset.ColumnMetadata.Builder.column) Assert.assertFalse(org.junit.Assert.assertFalse) Optional(java.util.Optional) ActionMetadataTestUtils.setStatistics(org.talend.dataprep.transformation.actions.ActionMetadataTestUtils.setStatistics) ActionDefinition(org.talend.dataprep.api.action.ActionDefinition) SelectParameter(org.talend.dataprep.parameters.SelectParameter) Collections(java.util.Collections) Assert.assertEquals(org.junit.Assert.assertEquals) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) PatternFrequency(org.talend.dataprep.api.dataset.statistics.PatternFrequency) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) Test(org.junit.Test)

Example 4 with Type

use of org.talend.dataprep.api.type.Type in project data-prep by Talend.

the class TypeChangeTest method should_not_accept_any_type_to_avoid_transformation_to_be_in_transfo_list.

@Test
public void should_not_accept_any_type_to_avoid_transformation_to_be_in_transfo_list() {
    // given
    final DomainChange domainChange = new DomainChange();
    for (final Type type : Type.values()) {
        final ColumnMetadata columnMetadata = // 
        ColumnMetadata.Builder.column().type(// 
        type).computedId(// 
        "0002").domain(// 
        "FR_BEER").domainFrequency(// 
        1).domainLabel(// 
        "French Beer").build();
        // when
        final boolean accepted = domainChange.acceptField(columnMetadata);
        // then
        assertThat(accepted).isTrue();
    }
}
Also used : Type(org.talend.dataprep.api.type.Type) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) Test(org.junit.Test) AbstractMetadataBaseTest(org.talend.dataprep.transformation.actions.AbstractMetadataBaseTest)

Example 5 with Type

use of org.talend.dataprep.api.type.Type in project data-prep by Talend.

the class QualityAnalysisTest method TDP_1150_string_must_be_detected_as_so_if_even_if_subtype_is_integer.

/**
 * This test ensures that string is detected as type even if we use the sub type (integer) of the most frequent type
 * (String) to detect invalids.
 *
 * See <a href="https://jira.talendforge.org/browse/TDP-224">https://jira.talendforge.org/browse/TDP-1150</a>.
 *
 * @throws Exception
 */
@Test
public void TDP_1150_string_must_be_detected_as_so_if_even_if_subtype_is_integer() {
    final DataSetMetadata actual = initializeDataSetMetadata(DataSetServiceTest.class.getResourceAsStream("../valid_must_be_text1.csv"));
    assertThat(actual.getLifecycle().schemaAnalyzed(), is(true));
    String expectedName = "user_id";
    Type expectedType = Type.STRING;
    ColumnMetadata column = actual.getRowMetadata().getColumns().get(0);
    assertThat(column.getName(), is(expectedName));
    assertThat(column.getType(), is(expectedType.getName()));
}
Also used : Type(org.talend.dataprep.api.type.Type) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSetServiceTest(org.talend.dataprep.dataset.service.DataSetServiceTest) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) Test(org.junit.Test) DataSetBaseTest(org.talend.dataprep.dataset.DataSetBaseTest) DataSetServiceTest(org.talend.dataprep.dataset.service.DataSetServiceTest)

Aggregations

Type (org.talend.dataprep.api.type.Type)24 ColumnMetadata (org.talend.dataprep.api.dataset.ColumnMetadata)21 Test (org.junit.Test)17 DataSetMetadata (org.talend.dataprep.api.dataset.DataSetMetadata)14 DataSetBaseTest (org.talend.dataprep.dataset.DataSetBaseTest)13 DataSetServiceTest (org.talend.dataprep.dataset.service.DataSetServiceTest)12 Arrays (java.util.Arrays)4 List (java.util.List)4 Optional (java.util.Optional)3 StringUtils (org.apache.commons.lang.StringUtils)3 Assert.assertEquals (org.junit.Assert.assertEquals)3 Builder.column (org.talend.dataprep.api.dataset.ColumnMetadata.Builder.column)3 RowMetadata (org.talend.dataprep.api.dataset.RowMetadata)3 PatternFrequency (org.talend.dataprep.api.dataset.statistics.PatternFrequency)3 IOException (java.io.IOException)2 Collections (java.util.Collections)2 HashMap (java.util.HashMap)2 Locale (java.util.Locale)2 Map (java.util.Map)2 Collectors (java.util.stream.Collectors)2