Search in sources :

Example 6 with SemanticDomain

use of org.talend.dataprep.api.dataset.statistics.SemanticDomain in project data-prep by Talend.

the class CopyColumnTest method test_TDP_567_with_force_false.

@Test
public void test_TDP_567_with_force_false() throws Exception {
    List<ColumnMetadata> input = new ArrayList<>();
    final ColumnMetadata original = createMetadata("0001", "column");
    original.setStatistics(new Statistics());
    SemanticDomain semanticDomain = new SemanticDomain("mountain_goat", "Mountain goat pale pale", 1);
    original.setDomain("beer");
    original.setDomainFrequency(1);
    original.setDomainLabel("the best beer");
    original.setDomainForced(false);
    original.setTypeForced(false);
    original.setSemanticDomains(Collections.singletonList(semanticDomain));
    input.add(original);
    RowMetadata rowMetadata = new RowMetadata(input);
    assertThat(rowMetadata.getColumns()).isNotNull().isNotEmpty().hasSize(1);
    final DataSetRow row = new DataSetRow(rowMetadata);
    ActionTestWorkbench.test(row, actionRegistry, factory.create(action, parameters));
    List<ColumnMetadata> actual = row.getRowMetadata().getColumns();
    assertThat(actual).isNotNull().isNotEmpty().hasSize(2);
    assertEquals(actual.get(1).getStatistics(), original.getStatistics());
    // 
    assertThat(actual.get(1)).isEqualToComparingOnlyGivenFields(original, "domain", "domainLabel", "domainFrequency", "domainForced", "typeForced");
    // 
    assertThat(actual.get(1).getSemanticDomains()).isNotNull().isNotEmpty().contains(semanticDomain);
}
Also used : ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) RowMetadata(org.talend.dataprep.api.dataset.RowMetadata) Statistics(org.talend.dataprep.api.dataset.statistics.Statistics) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) Test(org.junit.Test) AbstractMetadataBaseTest(org.talend.dataprep.transformation.actions.AbstractMetadataBaseTest)

Example 7 with SemanticDomain

use of org.talend.dataprep.api.dataset.statistics.SemanticDomain in project data-prep by Talend.

the class CopyColumnTest method test_TDP_567_with_force_true.

@Test
public void test_TDP_567_with_force_true() throws Exception {
    List<ColumnMetadata> input = new ArrayList<>();
    final ColumnMetadata original = createMetadata("0001", "column");
    original.setStatistics(new Statistics());
    SemanticDomain semanticDomain = new SemanticDomain("mountain_goat", "Mountain goat pale pale", 1);
    original.setDomain("beer");
    original.setDomainFrequency(1);
    original.setDomainLabel("the best beer");
    original.setDomainForced(true);
    original.setTypeForced(true);
    original.setSemanticDomains(Collections.singletonList(semanticDomain));
    input.add(original);
    RowMetadata rowMetadata = new RowMetadata(input);
    assertThat(rowMetadata.getColumns()).isNotNull().isNotEmpty().hasSize(1);
    final DataSetRow row = new DataSetRow(rowMetadata);
    ActionTestWorkbench.test(row, actionRegistry, factory.create(action, parameters));
    List<ColumnMetadata> actual = row.getRowMetadata().getColumns();
    assertThat(actual).isNotNull().isNotEmpty().hasSize(2);
    assertEquals(actual.get(1).getStatistics(), original.getStatistics());
    // 
    assertThat(actual.get(1)).isEqualToComparingOnlyGivenFields(original, "domain", "domainLabel", "domainFrequency", "domainForced", "typeForced");
    // 
    assertThat(actual.get(1).getSemanticDomains()).isNotNull().isNotEmpty().contains(semanticDomain);
}
Also used : ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) RowMetadata(org.talend.dataprep.api.dataset.RowMetadata) Statistics(org.talend.dataprep.api.dataset.statistics.Statistics) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) Test(org.junit.Test) AbstractMetadataBaseTest(org.talend.dataprep.transformation.actions.AbstractMetadataBaseTest)

Example 8 with SemanticDomain

use of org.talend.dataprep.api.dataset.statistics.SemanticDomain in project data-prep by Talend.

the class CopyColumnTest method should_copy_semantic.

@Test
public void should_copy_semantic() throws Exception {
    List<ColumnMetadata> input = new ArrayList<>();
    final ColumnMetadata original = createMetadata("0001", "column");
    original.setStatistics(new Statistics());
    SemanticDomain semanticDomain = new SemanticDomain("mountain_goat", "Mountain goat pale pale", 1);
    original.setDomain("beer");
    original.setDomainFrequency(1);
    original.setDomainLabel("the best beer");
    original.setSemanticDomains(Collections.singletonList(semanticDomain));
    input.add(original);
    RowMetadata rowMetadata = new RowMetadata(input);
    assertThat(rowMetadata.getColumns()).isNotNull().isNotEmpty().hasSize(1);
    final DataSetRow row = new DataSetRow(rowMetadata);
    ActionTestWorkbench.test(row, actionRegistry, factory.create(action, parameters));
    List<ColumnMetadata> actual = row.getRowMetadata().getColumns();
    assertThat(actual).isNotNull().isNotEmpty().hasSize(2);
    assertEquals(actual.get(1).getStatistics(), original.getStatistics());
    // 
    assertThat(actual.get(1)).isEqualToComparingOnlyGivenFields(original, "domain", "domainLabel", "domainFrequency");
    // 
    assertThat(actual.get(1).getSemanticDomains()).isNotNull().isNotEmpty().contains(semanticDomain);
}
Also used : ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) RowMetadata(org.talend.dataprep.api.dataset.RowMetadata) Statistics(org.talend.dataprep.api.dataset.statistics.Statistics) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) Test(org.junit.Test) AbstractMetadataBaseTest(org.talend.dataprep.transformation.actions.AbstractMetadataBaseTest)

Example 9 with SemanticDomain

use of org.talend.dataprep.api.dataset.statistics.SemanticDomain in project data-prep by Talend.

the class DataSetService method getDataSetColumnSemanticCategories.

/**
 * Return the semantic types for a given dataset / column.
 *
 * @param datasetId the datasetId id.
 * @param columnId the column id.
 * @return the semantic types for a given dataset / column.
 */
@RequestMapping(value = "/datasets/{datasetId}/columns/{columnId}/types", method = GET)
@ApiOperation(value = "list the types of the wanted column", notes = "This list can be used by user to change the column type.")
@Timed
@PublicAPI
public List<SemanticDomain> getDataSetColumnSemanticCategories(@ApiParam(value = "The dataset id") @PathVariable String datasetId, @ApiParam(value = "The column id") @PathVariable String columnId) {
    LOG.debug("listing semantic categories for dataset #{} column #{}", datasetId, columnId);
    final DataSetMetadata metadata = dataSetMetadataRepository.get(datasetId);
    if (metadata == null) {
        throw new TDPException(DataSetErrorCodes.DATASET_DOES_NOT_EXIST, ExceptionContext.withBuilder().put("id", datasetId).build());
    } else {
        try (final Stream<DataSetRow> records = contentStore.stream(metadata)) {
            final ColumnMetadata columnMetadata = metadata.getRowMetadata().getById(columnId);
            final Analyzer<Analyzers.Result> analyzer = analyzerService.build(columnMetadata, SEMANTIC);
            analyzer.init();
            records.map(r -> r.get(columnId)).forEach(analyzer::analyze);
            analyzer.end();
            final List<Analyzers.Result> analyzerResult = analyzer.getResult();
            final StatisticsAdapter statisticsAdapter = new StatisticsAdapter(40);
            statisticsAdapter.adapt(singletonList(columnMetadata), analyzerResult);
            LOG.debug("found {} for dataset #{}, column #{}", columnMetadata.getSemanticDomains(), datasetId, columnId);
            return columnMetadata.getSemanticDomains();
        }
    }
}
Also used : TDPException(org.talend.dataprep.exception.TDPException) VolumeMetered(org.talend.dataprep.metrics.VolumeMetered) RequestParam(org.springframework.web.bind.annotation.RequestParam) ImportBuilder(org.talend.dataprep.api.dataset.Import.ImportBuilder) FormatFamilyFactory(org.talend.dataprep.schema.FormatFamilyFactory) Autowired(org.springframework.beans.factory.annotation.Autowired) ApiParam(io.swagger.annotations.ApiParam) StringUtils(org.apache.commons.lang3.StringUtils) TEXT_PLAIN_VALUE(org.springframework.http.MediaType.TEXT_PLAIN_VALUE) SortAndOrderHelper.getDataSetMetadataComparator(org.talend.dataprep.util.SortAndOrderHelper.getDataSetMetadataComparator) Collections.singletonList(java.util.Collections.singletonList) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) BeanConversionService(org.talend.dataprep.conversions.BeanConversionService) PipedInputStream(java.io.PipedInputStream) DistributedLock(org.talend.dataprep.lock.DistributedLock) Arrays.asList(java.util.Arrays.asList) Map(java.util.Map) DataprepBundle.message(org.talend.dataprep.i18n.DataprepBundle.message) UserData(org.talend.dataprep.api.user.UserData) TaskExecutor(org.springframework.core.task.TaskExecutor) MAX_STORAGE_MAY_BE_EXCEEDED(org.talend.dataprep.exception.error.DataSetErrorCodes.MAX_STORAGE_MAY_BE_EXCEEDED) DataSet(org.talend.dataprep.api.dataset.DataSet) LocalStoreLocation(org.talend.dataprep.api.dataset.location.LocalStoreLocation) FormatFamily(org.talend.dataprep.schema.FormatFamily) Resource(javax.annotation.Resource) Set(java.util.Set) DatasetUpdatedEvent(org.talend.dataprep.dataset.event.DatasetUpdatedEvent) RestController(org.springframework.web.bind.annotation.RestController) QuotaService(org.talend.dataprep.dataset.store.QuotaService) Stream(java.util.stream.Stream) StreamSupport.stream(java.util.stream.StreamSupport.stream) FlagNames(org.talend.dataprep.api.dataset.row.FlagNames) UNEXPECTED_CONTENT(org.talend.dataprep.exception.error.CommonErrorCodes.UNEXPECTED_CONTENT) Analyzers(org.talend.dataquality.common.inference.Analyzers) DataSetLocatorService(org.talend.dataprep.api.dataset.location.locator.DataSetLocatorService) Callable(java.util.concurrent.Callable) Schema(org.talend.dataprep.schema.Schema) ArrayList(java.util.ArrayList) Value(org.springframework.beans.factory.annotation.Value) RequestBody(org.springframework.web.bind.annotation.RequestBody) DataSetLocationService(org.talend.dataprep.api.dataset.location.DataSetLocationService) AnalyzerService(org.talend.dataprep.quality.AnalyzerService) UserDataRepository(org.talend.dataprep.user.store.UserDataRepository) Markers(org.talend.dataprep.log.Markers) Api(io.swagger.annotations.Api) DraftValidator(org.talend.dataprep.schema.DraftValidator) HttpResponseContext(org.talend.dataprep.http.HttpResponseContext) Sort(org.talend.dataprep.util.SortAndOrderHelper.Sort) IOException(java.io.IOException) PipedOutputStream(java.io.PipedOutputStream) FormatAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.FormatAnalysis) ContentAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.ContentAnalysis) SchemaAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.SchemaAnalysis) HttpStatus(org.springframework.http.HttpStatus) FilterService(org.talend.dataprep.api.filter.FilterService) Marker(org.slf4j.Marker) NullOutputStream(org.apache.commons.io.output.NullOutputStream) StatisticsAdapter(org.talend.dataprep.dataset.StatisticsAdapter) Timed(org.talend.dataprep.metrics.Timed) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) PathVariable(org.springframework.web.bind.annotation.PathVariable) DataSetMetadataBuilder(org.talend.dataprep.dataset.DataSetMetadataBuilder) URLDecoder(java.net.URLDecoder) DataSetErrorCodes(org.talend.dataprep.exception.error.DataSetErrorCodes) PUT(org.springframework.web.bind.annotation.RequestMethod.PUT) LoggerFactory(org.slf4j.LoggerFactory) SEMANTIC(org.talend.dataprep.quality.AnalyzerService.Analysis.SEMANTIC) ApiOperation(io.swagger.annotations.ApiOperation) UNABLE_TO_CREATE_OR_UPDATE_DATASET(org.talend.dataprep.exception.error.DataSetErrorCodes.UNABLE_TO_CREATE_OR_UPDATE_DATASET) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) StrictlyBoundedInputStream(org.talend.dataprep.dataset.store.content.StrictlyBoundedInputStream) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) UNSUPPORTED_CONTENT(org.talend.dataprep.exception.error.DataSetErrorCodes.UNSUPPORTED_CONTENT) TimeToLive(org.talend.dataprep.cache.ContentCache.TimeToLive) Order(org.talend.dataprep.util.SortAndOrderHelper.Order) Collections.emptyList(java.util.Collections.emptyList) PublicAPI(org.talend.dataprep.security.PublicAPI) RequestMethod(org.springframework.web.bind.annotation.RequestMethod) UUID(java.util.UUID) Collectors(java.util.stream.Collectors) ContentCache(org.talend.dataprep.cache.ContentCache) INVALID_DATASET_NAME(org.talend.dataprep.exception.error.DataSetErrorCodes.INVALID_DATASET_NAME) List(java.util.List) Optional(java.util.Optional) Analyzer(org.talend.dataquality.common.inference.Analyzer) RequestHeader(org.springframework.web.bind.annotation.RequestHeader) Pattern(java.util.regex.Pattern) Security(org.talend.dataprep.security.Security) Spliterator(java.util.Spliterator) RowMetadata(org.talend.dataprep.api.dataset.RowMetadata) ComponentProperties(org.talend.dataprep.parameters.jsonschema.ComponentProperties) TDPException(org.talend.dataprep.exception.TDPException) JsonErrorCodeDescription(org.talend.dataprep.exception.json.JsonErrorCodeDescription) RequestMapping(org.springframework.web.bind.annotation.RequestMapping) UNABLE_CREATE_DATASET(org.talend.dataprep.exception.error.DataSetErrorCodes.UNABLE_CREATE_DATASET) HashMap(java.util.HashMap) GET(org.springframework.web.bind.annotation.RequestMethod.GET) Import(org.talend.dataprep.api.dataset.Import) ExceptionContext.build(org.talend.daikon.exception.ExceptionContext.build) ExceptionContext(org.talend.daikon.exception.ExceptionContext) Charset(java.nio.charset.Charset) UpdateColumnParameters(org.talend.dataprep.dataset.service.api.UpdateColumnParameters) VersionService(org.talend.dataprep.api.service.info.VersionService) POST(org.springframework.web.bind.annotation.RequestMethod.POST) OutputStream(java.io.OutputStream) DataSetLocation(org.talend.dataprep.api.dataset.DataSetLocation) Logger(org.slf4j.Logger) LocaleContextHolder.getLocale(org.springframework.context.i18n.LocaleContextHolder.getLocale) UpdateDataSetCacheKey(org.talend.dataprep.dataset.service.cache.UpdateDataSetCacheKey) IOUtils(org.apache.commons.compress.utils.IOUtils) APPLICATION_JSON_VALUE(org.springframework.http.MediaType.APPLICATION_JSON_VALUE) ResponseBody(org.springframework.web.bind.annotation.ResponseBody) Certification(org.talend.dataprep.api.dataset.DataSetGovernance.Certification) EncodingSupport(org.talend.dataprep.configuration.EncodingSupport) Comparator(java.util.Comparator) InputStream(java.io.InputStream) StatisticsAdapter(org.talend.dataprep.dataset.StatisticsAdapter) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) Timed(org.talend.dataprep.metrics.Timed) ApiOperation(io.swagger.annotations.ApiOperation) PublicAPI(org.talend.dataprep.security.PublicAPI) RequestMapping(org.springframework.web.bind.annotation.RequestMapping)

Example 10 with SemanticDomain

use of org.talend.dataprep.api.dataset.statistics.SemanticDomain in project data-prep by Talend.

the class DataSetService method updateDatasetColumn.

/**
 * Update the column of the data set and computes the
 *
 * @param dataSetId the dataset id.
 * @param columnId the column id.
 * @param parameters the new type and domain.
 */
@RequestMapping(value = "/datasets/{datasetId}/column/{columnId}", method = POST)
@ApiOperation(value = "Update a column type and/or domain")
@Timed
public void updateDatasetColumn(@PathVariable(value = "datasetId") @ApiParam(name = "datasetId", value = "Id of the dataset") final String dataSetId, @PathVariable(value = "columnId") @ApiParam(name = "columnId", value = "Id of the column") final String columnId, @RequestBody final UpdateColumnParameters parameters) {
    final DistributedLock lock = dataSetMetadataRepository.createDatasetMetadataLock(dataSetId);
    lock.lock();
    try {
        // check that dataset exists
        final DataSetMetadata dataSetMetadata = dataSetMetadataRepository.get(dataSetId);
        if (dataSetMetadata == null) {
            throw new TDPException(DataSetErrorCodes.DATASET_DOES_NOT_EXIST, build().put("id", dataSetId));
        }
        LOG.debug("update dataset column for #{} with type {} and/or domain {}", dataSetId, parameters.getType(), parameters.getDomain());
        // get the column
        final ColumnMetadata column = dataSetMetadata.getRowMetadata().getById(columnId);
        if (column == null) {
            throw new // 
            TDPException(// 
            DataSetErrorCodes.COLUMN_DOES_NOT_EXIST, // 
            build().put("id", // 
            dataSetId).put("columnid", columnId));
        }
        // update type/domain
        if (parameters.getType() != null) {
            column.setType(parameters.getType());
        }
        if (parameters.getDomain() != null) {
            // erase domain to let only type
            if (parameters.getDomain().isEmpty()) {
                column.setDomain("");
                column.setDomainLabel("");
                column.setDomainFrequency(0);
            } else // change domain
            {
                final SemanticDomain semanticDomain = column.getSemanticDomains().stream().filter(// 
                dom -> StringUtils.equals(dom.getId(), parameters.getDomain())).findFirst().orElse(null);
                if (semanticDomain != null) {
                    column.setDomain(semanticDomain.getId());
                    column.setDomainLabel(semanticDomain.getLabel());
                    column.setDomainFrequency(semanticDomain.getScore());
                }
            }
        }
        // save
        dataSetMetadataRepository.save(dataSetMetadata);
        // analyze the updated dataset (not all analysis are performed)
        analyzeDataSet(// 
        dataSetId, // 
        false, asList(ContentAnalysis.class, FormatAnalysis.class, SchemaAnalysis.class));
    } finally {
        lock.unlock();
    }
}
Also used : TDPException(org.talend.dataprep.exception.TDPException) DistributedLock(org.talend.dataprep.lock.DistributedLock) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) FormatAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.FormatAnalysis) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) SchemaAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.SchemaAnalysis) ContentAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.ContentAnalysis) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) Timed(org.talend.dataprep.metrics.Timed) ApiOperation(io.swagger.annotations.ApiOperation) RequestMapping(org.springframework.web.bind.annotation.RequestMapping)

Aggregations

SemanticDomain (org.talend.dataprep.api.dataset.statistics.SemanticDomain)11 ColumnMetadata (org.talend.dataprep.api.dataset.ColumnMetadata)10 Test (org.junit.Test)7 DataSetMetadata (org.talend.dataprep.api.dataset.DataSetMetadata)7 RowMetadata (org.talend.dataprep.api.dataset.RowMetadata)6 ApiOperation (io.swagger.annotations.ApiOperation)4 DataSetRow (org.talend.dataprep.api.dataset.row.DataSetRow)4 Statistics (org.talend.dataprep.api.dataset.statistics.Statistics)3 DataSetBaseTest (org.talend.dataprep.dataset.DataSetBaseTest)3 TDPException (org.talend.dataprep.exception.TDPException)3 Timed (org.talend.dataprep.metrics.Timed)3 AbstractMetadataBaseTest (org.talend.dataprep.transformation.actions.AbstractMetadataBaseTest)3 Api (io.swagger.annotations.Api)2 ApiParam (io.swagger.annotations.ApiParam)2 Collections.singletonList (java.util.Collections.singletonList)2 RequestMapping (org.springframework.web.bind.annotation.RequestMapping)2 Type (org.talend.dataprep.api.type.Type)2 DataSetServiceTest (org.talend.dataprep.dataset.service.DataSetServiceTest)2 ContentAnalysis (org.talend.dataprep.dataset.service.analysis.synchronous.ContentAnalysis)2 FormatAnalysis (org.talend.dataprep.dataset.service.analysis.synchronous.FormatAnalysis)2