Search in sources :

Example 6 with Schema

use of org.talend.dataprep.schema.Schema in project data-prep by Talend.

the class XlsSchemaParserTest method parse_should_extract_single_sheet_xls.

@Test
public void parse_should_extract_single_sheet_xls() throws Exception {
    // given
    final String fileName = "simple.xls";
    SchemaParser.Request request;
    try (InputStream inputStream = this.getClass().getResourceAsStream(fileName)) {
        request = getRequest(inputStream, "My Dataset");
        // when
        final Schema schema = parser.parse(request);
        // then
        assertThat(schema.getSheetContents(), is(notNullValue()));
        assertThat(schema.draft(), is(false));
        assertThat(schema.getSheetName(), is("Feuille1"));
    }
}
Also used : InputStream(java.io.InputStream) Schema(org.talend.dataprep.schema.Schema) SchemaParser(org.talend.dataprep.schema.SchemaParser) Test(org.junit.Test)

Example 7 with Schema

use of org.talend.dataprep.schema.Schema in project data-prep by Talend.

the class DataSetService method preview.

/**
 * Returns preview of the the data set content for given id (first 100 rows). Service might return
 * {@link org.apache.http.HttpStatus#SC_ACCEPTED} if the data set exists but analysis is not yet fully
 * completed so content is not yet ready to be served.
 *
 * @param metadata If <code>true</code>, includes data set metadata information.
 * @param sheetName the sheet name to preview
 * @param dataSetId A data set id.
 */
@RequestMapping(value = "/datasets/{id}/preview", method = RequestMethod.GET)
@ApiOperation(value = "Get a data preview set by id", notes = "Get a data set preview content based on provided id. Not valid or non existing data set id returns empty content. Data set not in drat status will return a redirect 301")
@Timed
@ResponseBody
public DataSet preview(@RequestParam(defaultValue = "true") @ApiParam(name = "metadata", value = "Include metadata information in the response") boolean metadata, @RequestParam(defaultValue = "") @ApiParam(name = "sheetName", value = "Sheet name to preview") String sheetName, @PathVariable(value = "id") @ApiParam(name = "id", value = "Id of the requested data set") String dataSetId) {
    DataSetMetadata dataSetMetadata = dataSetMetadataRepository.get(dataSetId);
    if (dataSetMetadata == null) {
        HttpResponseContext.status(HttpStatus.NO_CONTENT);
        // No data set, returns empty content.
        return DataSet.empty();
    }
    if (!dataSetMetadata.isDraft()) {
        // Moved to get data set content operation
        HttpResponseContext.status(HttpStatus.MOVED_PERMANENTLY);
        HttpResponseContext.header("Location", "/datasets/" + dataSetId + "/content");
        // dataset not anymore a draft so preview doesn't make sense.
        return DataSet.empty();
    }
    if (StringUtils.isNotEmpty(sheetName)) {
        dataSetMetadata.setSheetName(sheetName);
    }
    // take care of previous data without schema parser result
    if (dataSetMetadata.getSchemaParserResult() != null) {
        // sheet not yet set correctly so use the first one
        if (StringUtils.isEmpty(dataSetMetadata.getSheetName())) {
            String theSheetName = dataSetMetadata.getSchemaParserResult().getSheetContents().get(0).getName();
            LOG.debug("preview for dataSetMetadata: {} with sheetName: {}", dataSetId, theSheetName);
            dataSetMetadata.setSheetName(theSheetName);
        }
        String theSheetName = dataSetMetadata.getSheetName();
        Optional<Schema.SheetContent> sheetContentFound = dataSetMetadata.getSchemaParserResult().getSheetContents().stream().filter(sheetContent -> theSheetName.equals(sheetContent.getName())).findFirst();
        if (!sheetContentFound.isPresent()) {
            HttpResponseContext.status(HttpStatus.NO_CONTENT);
            // No sheet found, returns empty content.
            return DataSet.empty();
        }
        List<ColumnMetadata> columnMetadatas = sheetContentFound.get().getColumnMetadatas();
        if (dataSetMetadata.getRowMetadata() == null) {
            dataSetMetadata.setRowMetadata(new RowMetadata(emptyList()));
        }
        dataSetMetadata.getRowMetadata().setColumns(columnMetadatas);
    } else {
        LOG.warn("dataset#{} has draft status but any SchemaParserResult");
    }
    // Build the result
    DataSet dataSet = new DataSet();
    if (metadata) {
        dataSet.setMetadata(conversionService.convert(dataSetMetadata, UserDataSetMetadata.class));
    }
    dataSet.setRecords(contentStore.stream(dataSetMetadata).limit(100));
    return dataSet;
}
Also used : VolumeMetered(org.talend.dataprep.metrics.VolumeMetered) RequestParam(org.springframework.web.bind.annotation.RequestParam) ImportBuilder(org.talend.dataprep.api.dataset.Import.ImportBuilder) FormatFamilyFactory(org.talend.dataprep.schema.FormatFamilyFactory) Autowired(org.springframework.beans.factory.annotation.Autowired) ApiParam(io.swagger.annotations.ApiParam) StringUtils(org.apache.commons.lang3.StringUtils) TEXT_PLAIN_VALUE(org.springframework.http.MediaType.TEXT_PLAIN_VALUE) SortAndOrderHelper.getDataSetMetadataComparator(org.talend.dataprep.util.SortAndOrderHelper.getDataSetMetadataComparator) Collections.singletonList(java.util.Collections.singletonList) SemanticDomain(org.talend.dataprep.api.dataset.statistics.SemanticDomain) BeanConversionService(org.talend.dataprep.conversions.BeanConversionService) PipedInputStream(java.io.PipedInputStream) DistributedLock(org.talend.dataprep.lock.DistributedLock) Arrays.asList(java.util.Arrays.asList) Map(java.util.Map) DataprepBundle.message(org.talend.dataprep.i18n.DataprepBundle.message) UserData(org.talend.dataprep.api.user.UserData) TaskExecutor(org.springframework.core.task.TaskExecutor) MAX_STORAGE_MAY_BE_EXCEEDED(org.talend.dataprep.exception.error.DataSetErrorCodes.MAX_STORAGE_MAY_BE_EXCEEDED) DataSet(org.talend.dataprep.api.dataset.DataSet) LocalStoreLocation(org.talend.dataprep.api.dataset.location.LocalStoreLocation) FormatFamily(org.talend.dataprep.schema.FormatFamily) Resource(javax.annotation.Resource) Set(java.util.Set) DatasetUpdatedEvent(org.talend.dataprep.dataset.event.DatasetUpdatedEvent) RestController(org.springframework.web.bind.annotation.RestController) QuotaService(org.talend.dataprep.dataset.store.QuotaService) Stream(java.util.stream.Stream) StreamSupport.stream(java.util.stream.StreamSupport.stream) FlagNames(org.talend.dataprep.api.dataset.row.FlagNames) UNEXPECTED_CONTENT(org.talend.dataprep.exception.error.CommonErrorCodes.UNEXPECTED_CONTENT) Analyzers(org.talend.dataquality.common.inference.Analyzers) DataSetLocatorService(org.talend.dataprep.api.dataset.location.locator.DataSetLocatorService) Callable(java.util.concurrent.Callable) Schema(org.talend.dataprep.schema.Schema) ArrayList(java.util.ArrayList) Value(org.springframework.beans.factory.annotation.Value) RequestBody(org.springframework.web.bind.annotation.RequestBody) DataSetLocationService(org.talend.dataprep.api.dataset.location.DataSetLocationService) AnalyzerService(org.talend.dataprep.quality.AnalyzerService) UserDataRepository(org.talend.dataprep.user.store.UserDataRepository) Markers(org.talend.dataprep.log.Markers) Api(io.swagger.annotations.Api) DraftValidator(org.talend.dataprep.schema.DraftValidator) HttpResponseContext(org.talend.dataprep.http.HttpResponseContext) Sort(org.talend.dataprep.util.SortAndOrderHelper.Sort) IOException(java.io.IOException) PipedOutputStream(java.io.PipedOutputStream) FormatAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.FormatAnalysis) ContentAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.ContentAnalysis) SchemaAnalysis(org.talend.dataprep.dataset.service.analysis.synchronous.SchemaAnalysis) HttpStatus(org.springframework.http.HttpStatus) FilterService(org.talend.dataprep.api.filter.FilterService) Marker(org.slf4j.Marker) NullOutputStream(org.apache.commons.io.output.NullOutputStream) StatisticsAdapter(org.talend.dataprep.dataset.StatisticsAdapter) Timed(org.talend.dataprep.metrics.Timed) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) PathVariable(org.springframework.web.bind.annotation.PathVariable) DataSetMetadataBuilder(org.talend.dataprep.dataset.DataSetMetadataBuilder) URLDecoder(java.net.URLDecoder) DataSetErrorCodes(org.talend.dataprep.exception.error.DataSetErrorCodes) PUT(org.springframework.web.bind.annotation.RequestMethod.PUT) LoggerFactory(org.slf4j.LoggerFactory) SEMANTIC(org.talend.dataprep.quality.AnalyzerService.Analysis.SEMANTIC) ApiOperation(io.swagger.annotations.ApiOperation) UNABLE_TO_CREATE_OR_UPDATE_DATASET(org.talend.dataprep.exception.error.DataSetErrorCodes.UNABLE_TO_CREATE_OR_UPDATE_DATASET) DataSetRow(org.talend.dataprep.api.dataset.row.DataSetRow) StrictlyBoundedInputStream(org.talend.dataprep.dataset.store.content.StrictlyBoundedInputStream) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) UNSUPPORTED_CONTENT(org.talend.dataprep.exception.error.DataSetErrorCodes.UNSUPPORTED_CONTENT) TimeToLive(org.talend.dataprep.cache.ContentCache.TimeToLive) Order(org.talend.dataprep.util.SortAndOrderHelper.Order) Collections.emptyList(java.util.Collections.emptyList) PublicAPI(org.talend.dataprep.security.PublicAPI) RequestMethod(org.springframework.web.bind.annotation.RequestMethod) UUID(java.util.UUID) Collectors(java.util.stream.Collectors) ContentCache(org.talend.dataprep.cache.ContentCache) INVALID_DATASET_NAME(org.talend.dataprep.exception.error.DataSetErrorCodes.INVALID_DATASET_NAME) List(java.util.List) Optional(java.util.Optional) Analyzer(org.talend.dataquality.common.inference.Analyzer) RequestHeader(org.springframework.web.bind.annotation.RequestHeader) Pattern(java.util.regex.Pattern) Security(org.talend.dataprep.security.Security) Spliterator(java.util.Spliterator) RowMetadata(org.talend.dataprep.api.dataset.RowMetadata) ComponentProperties(org.talend.dataprep.parameters.jsonschema.ComponentProperties) TDPException(org.talend.dataprep.exception.TDPException) JsonErrorCodeDescription(org.talend.dataprep.exception.json.JsonErrorCodeDescription) RequestMapping(org.springframework.web.bind.annotation.RequestMapping) UNABLE_CREATE_DATASET(org.talend.dataprep.exception.error.DataSetErrorCodes.UNABLE_CREATE_DATASET) HashMap(java.util.HashMap) GET(org.springframework.web.bind.annotation.RequestMethod.GET) Import(org.talend.dataprep.api.dataset.Import) ExceptionContext.build(org.talend.daikon.exception.ExceptionContext.build) ExceptionContext(org.talend.daikon.exception.ExceptionContext) Charset(java.nio.charset.Charset) UpdateColumnParameters(org.talend.dataprep.dataset.service.api.UpdateColumnParameters) VersionService(org.talend.dataprep.api.service.info.VersionService) POST(org.springframework.web.bind.annotation.RequestMethod.POST) OutputStream(java.io.OutputStream) DataSetLocation(org.talend.dataprep.api.dataset.DataSetLocation) Logger(org.slf4j.Logger) LocaleContextHolder.getLocale(org.springframework.context.i18n.LocaleContextHolder.getLocale) UpdateDataSetCacheKey(org.talend.dataprep.dataset.service.cache.UpdateDataSetCacheKey) IOUtils(org.apache.commons.compress.utils.IOUtils) APPLICATION_JSON_VALUE(org.springframework.http.MediaType.APPLICATION_JSON_VALUE) ResponseBody(org.springframework.web.bind.annotation.ResponseBody) Certification(org.talend.dataprep.api.dataset.DataSetGovernance.Certification) EncodingSupport(org.talend.dataprep.configuration.EncodingSupport) Comparator(java.util.Comparator) InputStream(java.io.InputStream) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) DataSet(org.talend.dataprep.api.dataset.DataSet) RowMetadata(org.talend.dataprep.api.dataset.RowMetadata) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) Timed(org.talend.dataprep.metrics.Timed) ApiOperation(io.swagger.annotations.ApiOperation) RequestMapping(org.springframework.web.bind.annotation.RequestMapping) ResponseBody(org.springframework.web.bind.annotation.ResponseBody)

Example 8 with Schema

use of org.talend.dataprep.schema.Schema in project data-prep by Talend.

the class HtmlSchemaParser method parse.

/**
 * @see SchemaParser#parse(Request)
 */
@Override
public Schema parse(Request request) {
    try {
        SimpleHeadersContentHandler headersContentHandler = new SimpleHeadersContentHandler();
        InputStream inputStream = request.getContent();
        HtmlParser htmlParser = new HtmlParser();
        Metadata metadata = new Metadata();
        htmlParser.parse(inputStream, headersContentHandler, metadata, new ParseContext());
        List<ColumnMetadata> columns = new ArrayList<>(headersContentHandler.getHeaderValues().size());
        for (String headerValue : headersContentHandler.getHeaderValues()) {
            columns.add(// 
            ColumnMetadata.Builder.column().type(// ATM not doing any complicated type calculation
            Type.STRING).name(// 
            headerValue).id(// 
            columns.size()).build());
        }
        Schema.SheetContent sheetContent = new Schema.SheetContent();
        sheetContent.setColumnMetadatas(columns);
        return // 
        Schema.Builder.parserResult().sheetContents(// 
        Collections.singletonList(sheetContent)).draft(// 
        false).build();
    } catch (Exception e) {
        LOGGER.debug("Exception during parsing html request :" + e.getMessage(), e);
        throw new TDPException(CommonErrorCodes.UNEXPECTED_EXCEPTION, e);
    }
}
Also used : ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) InputStream(java.io.InputStream) Schema(org.talend.dataprep.schema.Schema) Metadata(org.apache.tika.metadata.Metadata) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) TDPException(org.talend.dataprep.exception.TDPException) TDPException(org.talend.dataprep.exception.TDPException) HtmlParser(org.apache.tika.parser.html.HtmlParser) ParseContext(org.apache.tika.parser.ParseContext)

Example 9 with Schema

use of org.talend.dataprep.schema.Schema in project data-prep by Talend.

the class DataSetMetadataBuilderTest method createCompleteMetadata.

private DataSetMetadata createCompleteMetadata() {
    final Map<String, String> parameters = new HashMap<>(0);
    parameters.put("encoding", "UTF-8");
    final List<ColumnMetadata> columnsMetadata = new ArrayList<>(2);
    columnsMetadata.add(ColumnMetadata.Builder.column().id(0).name("id").type(INTEGER).build());
    columnsMetadata.add(ColumnMetadata.Builder.column().id(1).name("Name").type(STRING).build());
    final RowMetadata rowMetadata = new RowMetadata();
    rowMetadata.setColumns(columnsMetadata);
    final DataSetContent content = new DataSetContent();
    content.setNbRecords(1000);
    content.setLimit(1000L);
    content.setNbLinesInHeader(10);
    content.setNbLinesInFooter(10);
    content.setFormatFamilyId("formatGuess#csv");
    content.setMediaType("text/csv");
    content.setParameters(parameters);
    final Schema schemaParserResult = new Schema.Builder().draft(true).build();
    final DataSetMetadata metadata = new DataSetMetadata("18ba64c154d5", "Avengers stats", "Stan Lee", System.currentTimeMillis(), System.currentTimeMillis(), rowMetadata, "1.0");
    metadata.setLocation(new LocalStoreLocation());
    metadata.getGovernance().setCertificationStep(CERTIFIED);
    metadata.setSheetName("Sheet 1");
    metadata.setDraft(true);
    metadata.setContent(content);
    metadata.setEncoding("UTF-8");
    metadata.getLifecycle().contentIndexed(true);
    metadata.getLifecycle().qualityAnalyzed(true);
    metadata.getLifecycle().schemaAnalyzed(true);
    metadata.getLifecycle().setInProgress(true);
    metadata.getLifecycle().setImporting(true);
    metadata.setSchemaParserResult(schemaParserResult);
    metadata.setTag("MyTag");
    return metadata;
}
Also used : HashMap(java.util.HashMap) Schema(org.talend.dataprep.schema.Schema) DataSetMetadataBuilder(org.talend.dataprep.dataset.DataSetMetadataBuilder) ArrayList(java.util.ArrayList) LocalStoreLocation(org.talend.dataprep.api.dataset.location.LocalStoreLocation)

Example 10 with Schema

use of org.talend.dataprep.schema.Schema in project data-prep by Talend.

the class CSVSchemaParserTest method TDP_2171.

@Test
public void TDP_2171() throws IOException {
    try (InputStream inputStream = this.getClass().getResourceAsStream("TDP-2171.csv")) {
        // We do know the format and therefore we go directly to the CSV schema guessing
        SchemaParser.Request request = getRequest(inputStream, "#1");
        request.getMetadata().setEncoding("UTF-8");
        Schema schema = csvSchemaParser.parse(request);
        final Map<String, String> parameters = request.getMetadata().getContent().getParameters();
        char actual = parameters.get(CSVFormatFamily.SEPARATOR_PARAMETER).charAt(0);
        List<String> header = csvFormatUtils.retrieveHeader(parameters);
        assertEquals(',', actual);
        assertEquals(92, header.size());
    }
}
Also used : ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) Schema(org.talend.dataprep.schema.Schema) SchemaParser(org.talend.dataprep.schema.SchemaParser) Test(org.junit.Test)

Aggregations

Schema (org.talend.dataprep.schema.Schema)14 InputStream (java.io.InputStream)10 SchemaParser (org.talend.dataprep.schema.SchemaParser)8 Test (org.junit.Test)7 ColumnMetadata (org.talend.dataprep.api.dataset.ColumnMetadata)7 DataSetMetadata (org.talend.dataprep.api.dataset.DataSetMetadata)5 TDPException (org.talend.dataprep.exception.TDPException)5 Map (java.util.Map)4 IOException (java.io.IOException)3 ArrayList (java.util.ArrayList)3 HashMap (java.util.HashMap)3 ObjectMapper (com.fasterxml.jackson.databind.ObjectMapper)2 CollectionType (com.fasterxml.jackson.databind.type.CollectionType)2 Api (io.swagger.annotations.Api)2 ApiOperation (io.swagger.annotations.ApiOperation)2 ApiParam (io.swagger.annotations.ApiParam)2 ByteArrayInputStream (java.io.ByteArrayInputStream)2 OutputStream (java.io.OutputStream)2 PipedInputStream (java.io.PipedInputStream)2 PipedOutputStream (java.io.PipedOutputStream)2