Search in sources :

Example 1 with SemanticDomain

use of org.talend.dataprep.api.dataset.statistics.SemanticDomain in project data-prep by Talend.

the class StandardizeInvalidTest method should_accept_column.

@Test
public void should_accept_column() {
    // a column with semantic
    SemanticCategoryEnum semantic = SemanticCategoryEnum.COUNTRY;
    List<SemanticDomain> semanticDomainLs = new ArrayList<>();
    semanticDomainLs.add(new SemanticDomain("COUNTRY", "Country", 0.85f));
    ColumnMetadata column = ColumnMetadata.Builder.column().id(0).name("name").type(Type.STRING).semanticDomains(semanticDomainLs).domain(semantic.name()).build();
    assertTrue(action.acceptField(column));
}
Also used : SemanticCategoryEnum(org.talend.dataquality.semantic.classifier.SemanticCategoryEnum) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) Test(org.junit.Test) AbstractMetadataBaseTest(org.talend.dataprep.transformation.actions.AbstractMetadataBaseTest)

Example 2 with SemanticDomain

use of org.talend.dataprep.api.dataset.statistics.SemanticDomain in project data-prep by Talend.

the class DataSetServiceTest method updateDatasetColumn_should_update_domain.

@Test
public void updateDatasetColumn_should_update_domain() throws Exception {
    // given
    final String dataSetId = // 
    given().body(// 
    IOUtils.toString(this.getClass().getResourceAsStream(TAGADA_CSV), UTF_8)).queryParam(CONTENT_TYPE, // 
    "text/csv").when().post(// 
    "/datasets").asString();
    final ColumnMetadata column;
    // update the metadata in the repository (lock mechanism is needed otherwise semantic domain will be erased by
    // analysis)
    final DistributedLock lock = dataSetMetadataRepository.createDatasetMetadataLock(dataSetId);
    DataSetMetadata dataSetMetadata;
    RowMetadata row;
    lock.lock();
    try {
        dataSetMetadata = dataSetMetadataRepository.get(dataSetId);
        assertNotNull(dataSetMetadata);
        row = dataSetMetadata.getRowMetadata();
        assertNotNull(row);
        column = row.getById("0002");
        final SemanticDomain jsoDomain = new SemanticDomain("JSO", "JSO label", 1.0F);
        column.getSemanticDomains().add(jsoDomain);
        dataSetMetadataRepository.save(dataSetMetadata);
    } finally {
        lock.unlock();
    }
    assertThat(column.getDomain(), is("FIRST_NAME"));
    assertThat(column.getDomainLabel(), is("First Name"));
    assertThat(column.getDomainFrequency(), is(100.0F));
    // when
    final Response res = // 
    given().body(// 
    "{\"domain\": \"JSO\"}").when().contentType(// 
    JSON).post("/datasets/{dataSetId}/column/{columnId}", dataSetId, "0002");
    // then
    res.then().statusCode(200);
    dataSetMetadata = dataSetMetadataRepository.get(dataSetId);
    assertNotNull(dataSetMetadata);
    row = dataSetMetadata.getRowMetadata();
    assertNotNull(row);
    final ColumnMetadata actual = row.getById("0002");
    assertThat(actual.getDomain(), is("JSO"));
    assertThat(actual.getDomainLabel(), is("JSO label"));
    assertThat(actual.getDomainFrequency(), is(1.0F));
}
Also used : Response(com.jayway.restassured.response.Response) DistributedLock(org.talend.dataprep.lock.DistributedLock) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) RowMetadata(org.talend.dataprep.api.dataset.RowMetadata) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) Matchers.containsString(org.hamcrest.Matchers.containsString) Matchers.isEmptyString(org.hamcrest.Matchers.isEmptyString) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) DataSetBaseTest(org.talend.dataprep.dataset.DataSetBaseTest) Test(org.junit.Test)

Example 3 with SemanticDomain

use of org.talend.dataprep.api.dataset.statistics.SemanticDomain in project data-prep by Talend.

the class SchemaAnalysisTest method testTDP_279.

/**
 * See <a href="https://jira.talendforge.org/browse/TDP-279">https://jira.talendforge.org/browse/TDP-279</a>.
 *
 * @throws Exception
 */
@Test
public void testTDP_279() {
    final DataSetMetadata actual = initializeDataSetMetadata(DataSetServiceTest.class.getResourceAsStream("../post_code.xls"));
    assertThat(actual.getLifecycle().schemaAnalyzed(), is(true));
    String[] expectedNames = { "zip" };
    Type[] expectedTypes = { Type.INTEGER };
    String[] expectedDomains = { "FR_POSTAL_CODE" };
    int i = 0;
    for (ColumnMetadata column : actual.getRowMetadata().getColumns()) {
        assertThat(column.getName(), is(expectedNames[i]));
        assertThat(column.getType(), is(expectedTypes[i].getName()));
        assertThat(column.getDomain(), is(expectedDomains[i++]));
        assertThat(column.getSemanticDomains()).isNotNull().isNotEmpty().hasSize(4).contains(// 
        new SemanticDomain("FR_POSTAL_CODE", "FR Postal Code", (float) 58.33), // 
        new SemanticDomain("FR_CODE_COMMUNE_INSEE", "FR Insee Code", (float) 58.33), // 
        new SemanticDomain("DE_POSTAL_CODE", "DE Postal Code", (float) 58.33), new SemanticDomain("US_POSTAL_CODE", "US Postal Code", (float) 58.33));
    }
}
Also used : Type(org.talend.dataprep.api.type.Type) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSetServiceTest(org.talend.dataprep.dataset.service.DataSetServiceTest) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) Test(org.junit.Test) DataSetBaseTest(org.talend.dataprep.dataset.DataSetBaseTest) DataSetServiceTest(org.talend.dataprep.dataset.service.DataSetServiceTest)

Example 4 with SemanticDomain

use of org.talend.dataprep.api.dataset.statistics.SemanticDomain in project data-prep by Talend.

the class SchemaAnalysisTest method testTDP_471.

/**
 * See <a href="https://jira.talendforge.org/browse/TDP-471">https://jira.talendforge.org/browse/TDP-471</a>.
 *
 * @throws Exception
 */
@Test
public void testTDP_471() {
    final DataSetMetadata actual = initializeDataSetMetadata(DataSetServiceTest.class.getResourceAsStream("../semantic_type_threshold.csv"));
    assertThat(actual.getLifecycle().schemaAnalyzed(), is(true));
    String[] expectedNames = { "gender_column" };
    Type[] expectedTypes = { Type.INTEGER };
    String[] expectedDomains = { "" };
    int i = 0;
    for (ColumnMetadata column : actual.getRowMetadata().getColumns()) {
        assertThat(column.getName(), is(expectedNames[i]));
        assertThat(column.getType(), is(expectedTypes[i].getName()));
        assertThat(column.getDomain(), is(expectedDomains[i++]));
        assertThat(column.getSemanticDomains()).isNotNull().isNotEmpty().hasSize(2).contains(// 
        new SemanticDomain("GENDER", "Gender", (float) 35), new SemanticDomain("CIVILITY", "Civility", (float) 20.833334));
    }
}
Also used : Type(org.talend.dataprep.api.type.Type) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSetServiceTest(org.talend.dataprep.dataset.service.DataSetServiceTest) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) Test(org.junit.Test) DataSetBaseTest(org.talend.dataprep.dataset.DataSetBaseTest) DataSetServiceTest(org.talend.dataprep.dataset.service.DataSetServiceTest)

Example 5 with SemanticDomain

use of org.talend.dataprep.api.dataset.statistics.SemanticDomain in project data-prep by Talend.

the class TransformationService method getSemanticDomains.

/**
 * Return the semantic domains for the given parameters.
 *
 * @param metadata the dataset metadata.
 * @param columnId the column id to analyze.
 * @param records the dataset records.
 * @return the semantic domains for the given parameters.
 * @throws IOException can happen...
 */
private List<SemanticDomain> getSemanticDomains(DataSetMetadata metadata, String columnId, InputStream records) throws IOException {
    // copy the column metadata and set the semantic domain forced flag to false to make sure the statistics adapter set all
    // available domains
    final ColumnMetadata columnMetadata = // 
    column().copy(// 
    metadata.getRowMetadata().getById(columnId)).semanticDomainForce(// 
    false).build();
    final Analyzer<Analyzers.Result> analyzer = analyzerService.build(columnMetadata, SEMANTIC);
    analyzer.init();
    try (final JsonParser parser = mapper.getFactory().createParser(new InputStreamReader(records, UTF_8))) {
        final DataSet dataSet = mapper.readerFor(DataSet.class).readValue(parser);
        dataSet.getRecords().map(// 
        r -> r.get(columnId)).forEach(analyzer::analyze);
        analyzer.end();
    }
    final List<Analyzers.Result> analyzerResult = analyzer.getResult();
    statisticsAdapter.adapt(singletonList(columnMetadata), analyzerResult);
    return columnMetadata.getSemanticDomains();
}
Also used : VolumeMetered(org.talend.dataprep.metrics.VolumeMetered) LocaleContextHolder(org.springframework.context.i18n.LocaleContextHolder) StringUtils(org.apache.commons.lang.StringUtils) ContentCacheKey(org.talend.dataprep.cache.ContentCacheKey) TdqCategories(org.talend.dataquality.semantic.broadcast.TdqCategories) Autowired(org.springframework.beans.factory.annotation.Autowired) ApiParam(io.swagger.annotations.ApiParam) PreviewParameters(org.talend.dataprep.transformation.preview.api.PreviewParameters) ExportFormatMessage(org.talend.dataprep.format.export.ExportFormatMessage) ActionContext(org.talend.dataprep.transformation.api.action.context.ActionContext) Collections.singletonList(java.util.Collections.singletonList) ScopeCategory(org.talend.dataprep.transformation.actions.category.ScopeCategory) Valid(javax.validation.Valid) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) BeanConversionService(org.talend.dataprep.conversions.BeanConversionService) GetPrepMetadataAsyncCondition(org.talend.dataprep.async.conditional.GetPrepMetadataAsyncCondition) TaskExecutor(org.springframework.core.task.TaskExecutor) DataSet(org.talend.dataprep.api.dataset.DataSet) ExportParametersUtil(org.talend.dataprep.api.export.ExportParametersUtil) StepDiff(org.talend.dataprep.api.preparation.StepDiff) PreparationDetailsGet(org.talend.dataprep.command.preparation.PreparationDetailsGet) HEAD(org.talend.dataprep.api.export.ExportParameters.SourceType.HEAD) APPLICATION_OCTET_STREAM_VALUE(org.springframework.http.MediaType.APPLICATION_OCTET_STREAM_VALUE) Resource(javax.annotation.Resource) StreamingResponseBody(org.springframework.web.servlet.mvc.method.annotation.StreamingResponseBody) JSON(org.talend.dataprep.transformation.format.JsonFormat.JSON) PreparationGetContentUrlGenerator(org.talend.dataprep.async.result.PreparationGetContentUrlGenerator) SecurityProxy(org.talend.dataprep.security.SecurityProxy) Stream(java.util.stream.Stream) Builder.column(org.talend.dataprep.api.dataset.ColumnMetadata.Builder.column) org.springframework.web.bind.annotation(org.springframework.web.bind.annotation) GZIPOutputStream(java.util.zip.GZIPOutputStream) DynamicType(org.talend.dataprep.transformation.api.action.dynamic.DynamicType) RunnableAction(org.talend.dataprep.transformation.actions.common.RunnableAction) Analyzers(org.talend.dataquality.common.inference.Analyzers) java.util(java.util) TransformationErrorCodes(org.talend.dataprep.exception.error.TransformationErrorCodes) GenericParameter(org.talend.dataprep.transformation.api.action.dynamic.GenericParameter) Configuration(org.talend.dataprep.transformation.api.transformer.configuration.Configuration) PreviewConfiguration(org.talend.dataprep.transformation.api.transformer.configuration.PreviewConfiguration) AnalyzerService(org.talend.dataprep.quality.AnalyzerService) TransformationMetadataCacheKey(org.talend.dataprep.cache.TransformationMetadataCacheKey) PrepMetadataExecutionIdGenerator(org.talend.dataprep.async.generator.PrepMetadataExecutionIdGenerator) PREPARATION_DOES_NOT_EXIST(org.talend.dataprep.exception.error.PreparationErrorCodes.PREPARATION_DOES_NOT_EXIST) Api(io.swagger.annotations.Api) Preparation(org.talend.dataprep.api.preparation.Preparation) ActionRegistry(org.talend.dataprep.transformation.pipeline.ActionRegistry) GetPrepContentAsyncCondition(org.talend.dataprep.async.conditional.GetPrepContentAsyncCondition) TransformationContext(org.talend.dataprep.transformation.api.action.context.TransformationContext) AggregationService(org.talend.dataprep.transformation.aggregation.AggregationService) NullOutputStream(org.apache.commons.io.output.NullOutputStream) StatisticsAdapter(org.talend.dataprep.dataset.StatisticsAdapter) DataSetGetMetadata(org.talend.dataprep.command.dataset.DataSetGetMetadata) Timed(org.talend.dataprep.metrics.Timed) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSetGet(org.talend.dataprep.command.dataset.DataSetGet) AggregationResult(org.talend.dataprep.transformation.aggregation.api.AggregationResult) LoggerFactory(org.slf4j.LoggerFactory) Flag(org.talend.dataprep.api.dataset.row.Flag) SEMANTIC(org.talend.dataprep.quality.AnalyzerService.Analysis.SEMANTIC) ActionParser(org.talend.dataprep.transformation.api.action.ActionParser) CacheKeyGenerator(org.talend.dataprep.cache.CacheKeyGenerator) ApiOperation(io.swagger.annotations.ApiOperation) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) ExportParameters(org.talend.dataprep.api.export.ExportParameters) PrepMetadataGetContentUrlGenerator(org.talend.dataprep.async.result.PrepMetadataGetContentUrlGenerator) MediaType(org.springframework.http.MediaType) PublicAPI(org.talend.dataprep.security.PublicAPI) RequestMethod(org.springframework.web.bind.annotation.RequestMethod) Collectors(java.util.stream.Collectors) ContentCache(org.talend.dataprep.cache.ContentCache) UNEXPECTED_EXCEPTION(org.talend.dataprep.exception.error.TransformationErrorCodes.UNEXPECTED_EXCEPTION) TransformerFactory(org.talend.dataprep.transformation.api.transformer.TransformerFactory) CommonErrorCodes(org.talend.dataprep.exception.error.CommonErrorCodes) Analyzer(org.talend.dataquality.common.inference.Analyzer) ActionDefinition(org.talend.dataprep.api.action.ActionDefinition) PreparationExportStrategy(org.talend.dataprep.transformation.service.export.PreparationExportStrategy) RowMetadata(org.talend.dataprep.api.dataset.RowMetadata) ExportFormat(org.talend.dataprep.format.export.ExportFormat) TDPException(org.talend.dataprep.exception.TDPException) JsonErrorCodeDescription(org.talend.dataprep.exception.json.JsonErrorCodeDescription) ExceptionContext.build(org.talend.daikon.exception.ExceptionContext.build) ExportParametersExecutionIdGenerator(org.talend.dataprep.async.generator.ExportParametersExecutionIdGenerator) ExceptionContext(org.talend.daikon.exception.ExceptionContext) org.talend.dataprep.async(org.talend.dataprep.async) Suggestion(org.talend.dataprep.transformation.api.transformer.suggestion.Suggestion) SuggestionEngine(org.talend.dataprep.transformation.api.transformer.suggestion.SuggestionEngine) Logger(org.slf4j.Logger) LocaleContextHolder.getLocale(org.springframework.context.i18n.LocaleContextHolder.getLocale) JsonParser(com.fasterxml.jackson.core.JsonParser) UTF_8(java.nio.charset.StandardCharsets.UTF_8) Step(org.talend.dataprep.api.preparation.Step) APPLICATION_JSON_VALUE(org.springframework.http.MediaType.APPLICATION_JSON_VALUE) ApplicationContext(org.springframework.context.ApplicationContext) ActionForm(org.talend.dataprep.api.action.ActionForm) AggregationParameters(org.talend.dataprep.transformation.aggregation.api.AggregationParameters) java.io(java.io) TdqCategoriesFactory(org.talend.dataquality.semantic.broadcast.TdqCategoriesFactory) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSet(org.talend.dataprep.api.dataset.DataSet) AggregationResult(org.talend.dataprep.transformation.aggregation.api.AggregationResult) JsonParser(com.fasterxml.jackson.core.JsonParser)

Aggregations

SemanticDomain (org.talend.dataprep.api.dataset.statistics.SemanticDomain)11 ColumnMetadata (org.talend.dataprep.api.dataset.ColumnMetadata)10 Test (org.junit.Test)7 DataSetMetadata (org.talend.dataprep.api.dataset.DataSetMetadata)7 RowMetadata (org.talend.dataprep.api.dataset.RowMetadata)6 ApiOperation (io.swagger.annotations.ApiOperation)4 DataSetRow (org.talend.dataprep.api.dataset.row.DataSetRow)4 Statistics (org.talend.dataprep.api.dataset.statistics.Statistics)3 DataSetBaseTest (org.talend.dataprep.dataset.DataSetBaseTest)3 TDPException (org.talend.dataprep.exception.TDPException)3 Timed (org.talend.dataprep.metrics.Timed)3 AbstractMetadataBaseTest (org.talend.dataprep.transformation.actions.AbstractMetadataBaseTest)3 Api (io.swagger.annotations.Api)2 ApiParam (io.swagger.annotations.ApiParam)2 Collections.singletonList (java.util.Collections.singletonList)2 RequestMapping (org.springframework.web.bind.annotation.RequestMapping)2 Type (org.talend.dataprep.api.type.Type)2 DataSetServiceTest (org.talend.dataprep.dataset.service.DataSetServiceTest)2 ContentAnalysis (org.talend.dataprep.dataset.service.analysis.synchronous.ContentAnalysis)2 FormatAnalysis (org.talend.dataprep.dataset.service.analysis.synchronous.FormatAnalysis)2