Search in sources :

Example 96 with DataSetMetadata

use of org.talend.dataprep.api.dataset.DataSetMetadata in project data-prep by Talend.

the class HtmlSerializer method deserialize.

private void deserialize(InputStream rawContent, DataSetMetadata dataSetMetadata, OutputStream jsonOutput, long limit) {
    try {
        List<ColumnMetadata> columns = dataSetMetadata.getRowMetadata().getColumns();
        SimpleValuesContentHandler valuesContentHandler = new SimpleValuesContentHandler(columns.size(), limit);
        HtmlParser htmlParser = new HtmlParser();
        Metadata metadata = new Metadata();
        htmlParser.parse(rawContent, valuesContentHandler, metadata, new ParseContext());
        JsonGenerator generator = new JsonFactory().createGenerator(jsonOutput);
        // start the record
        generator.writeStartArray();
        for (List<String> values : valuesContentHandler.getValues()) {
            if (values.isEmpty()) {
                // avoid empty record which can fail analysis
                continue;
            }
            generator.writeStartObject();
            int idx = 0;
            for (String value : values) {
                if (idx < columns.size()) {
                    ColumnMetadata columnMetadata = columns.get(idx);
                    generator.writeFieldName(columnMetadata.getId());
                    if (value != null) {
                        generator.writeString(value);
                    } else {
                        generator.writeNull();
                    }
                    idx++;
                }
            }
            generator.writeEndObject();
        }
        // end the record
        generator.writeEndArray();
        generator.flush();
    } catch (Exception e) {
        // Consumer may very well interrupt consumption of stream (in case of limit(n) use for sampling).
        // This is not an issue as consumer is allowed to partially consumes results, it's up to the
        // consumer to ensure data it consumed is consistent.
        LOGGER.debug("Unable to continue serialization for {}. Skipping remaining content.", dataSetMetadata.getId(), e);
    } finally {
        try {
            jsonOutput.close();
        } catch (IOException e) {
            LOGGER.error("Unable to close output", e);
        }
    }
}
Also used : ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) Metadata(org.apache.tika.metadata.Metadata) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) JsonFactory(com.fasterxml.jackson.core.JsonFactory) TDPException(org.talend.dataprep.exception.TDPException) HtmlParser(org.apache.tika.parser.html.HtmlParser) ParseContext(org.apache.tika.parser.ParseContext) JsonGenerator(com.fasterxml.jackson.core.JsonGenerator)

Example 97 with DataSetMetadata

use of org.talend.dataprep.api.dataset.DataSetMetadata in project data-prep by Talend.

the class DatasetUpdateListener method onUpdate.

@EventListener
public void onUpdate(DatasetUpdatedEvent event) {
    // when we update a dataset we need to clean cache
    final DataSetMetadata dataSetMetadata = event.getSource();
    final ContentCacheKey sampleKey = () -> "dataset-sample_" + dataSetMetadata.getId();
    LOGGER.debug("Evicting sample cache entry for #{}", dataSetMetadata.getId());
    publisher.publishEvent(new CleanCacheEvent(sampleKey));
    LOGGER.debug("Evicting sample cache entry for #{} done.", dataSetMetadata.getId());
    LOGGER.debug("Evicting transformation cache entry for dataset #{}", dataSetMetadata.getId());
    publisher.publishEvent(new CleanCacheEvent(new ContentCacheKey() {

        @Override
        public String getKey() {
            return dataSetMetadata.getId();
        }

        @Override
        public Predicate<String> getMatcher() {
            String regex = ".*_" + getKey() + "_.*";
            // Build regular expression matcher
            final Pattern pattern = Pattern.compile(regex);
            return str -> pattern.matcher(str).matches();
        }
    }, Boolean.TRUE));
    LOGGER.debug("Evicting transformation cache entry for dataset  #{} done.", dataSetMetadata.getId());
}
Also used : Component(org.springframework.stereotype.Component) Logger(org.slf4j.Logger) ContentCacheKey(org.talend.dataprep.cache.ContentCacheKey) Predicate(java.util.function.Predicate) LoggerFactory(org.slf4j.LoggerFactory) ApplicationEventPublisher(org.springframework.context.ApplicationEventPublisher) Autowired(org.springframework.beans.factory.annotation.Autowired) EventListener(org.springframework.context.event.EventListener) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) Pattern(java.util.regex.Pattern) DatasetUpdatedEvent(org.talend.dataprep.dataset.event.DatasetUpdatedEvent) Pattern(java.util.regex.Pattern) ContentCacheKey(org.talend.dataprep.cache.ContentCacheKey) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) EventListener(org.springframework.context.event.EventListener)

Example 98 with DataSetMetadata

use of org.talend.dataprep.api.dataset.DataSetMetadata in project data-prep by Talend.

the class XlsWriterTest method createSchemaParser.

/**
 * utility function
 */
public SchemaParser.Request createSchemaParser(String inputFileName) throws Exception {
    Path path = Files.createTempFile("datarep-foo", "xlsx");
    Files.deleteIfExists(path);
    try (final OutputStream outputStream = Files.newOutputStream(path)) {
        final Configuration configuration = // 
        Configuration.builder().format(// 
        XlsFormat.XLSX).output(// 
        outputStream).actions(// 
        "").build();
        final Transformer exporter = factory.get(configuration);
        final InputStream inputStream = XlsWriterTest.class.getResourceAsStream(inputFileName);
        try (JsonParser parser = mapper.getFactory().createParser(inputStream)) {
            final DataSet dataSet = mapper.readerFor(DataSet.class).readValue(parser);
            exporter.buildExecutable(dataSet, configuration).execute();
        }
    }
    DataSetMetadata metadata = metadataBuilder.metadata().id("123").build();
    return new SchemaParser.Request(Files.newInputStream(path), metadata);
}
Also used : Path(java.nio.file.Path) Transformer(org.talend.dataprep.transformation.api.transformer.Transformer) Configuration(org.talend.dataprep.transformation.api.transformer.configuration.Configuration) DataSet(org.talend.dataprep.api.dataset.DataSet) InputStream(java.io.InputStream) ByteArrayOutputStream(java.io.ByteArrayOutputStream) OutputStream(java.io.OutputStream) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) JsonParser(com.fasterxml.jackson.core.JsonParser)

Example 99 with DataSetMetadata

use of org.talend.dataprep.api.dataset.DataSetMetadata in project data-prep by Talend.

the class PreparationDatasetRowUpdaterTest method updatePreparations.

@Test
public void updatePreparations() throws Exception {
    // given
    String datasetId = "dataset id";
    Preparation prep = new Preparation("prepId", "123456");
    prep.setDataSetId(datasetId);
    final List<Preparation> preparations = singletonList(prep);
    when(preparationRepository.list(Preparation.class)).thenReturn(preparations.stream());
    DataSetMetadata datasetMetadata = new DataSetMetadata();
    datasetMetadata.setRowMetadata(new RowMetadata());
    when(dataSetMetadataRepository.get(datasetId)).thenReturn(datasetMetadata);
    // when
    updater.updatePreparations();
    // then
    verify(preparationRepository, times(1)).list(Preparation.class);
    verify(preparationRepository, times(1)).add(prep);
    verify(dataSetMetadataRepository, only()).get(datasetId);
}
Also used : Preparation(org.talend.dataprep.api.preparation.Preparation) RowMetadata(org.talend.dataprep.api.dataset.RowMetadata) DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata) Test(org.junit.Test)

Example 100 with DataSetMetadata

use of org.talend.dataprep.api.dataset.DataSetMetadata in project data-prep by Talend.

the class PreparationDatasetRowUpdater method addRowMetadata.

/**
 * Add the row metadata of the dataset to the preparation.
 *
 * @param preparation the preparation to update.
 * @return the updated preparation.
 */
private Preparation addRowMetadata(Preparation preparation) {
    LOGGER.debug("adding row metadata to preparation {}", preparation);
    DataSetMetadata dataSetMetadata = dataSetMetadataRepository.get(preparation.getDataSetId());
    if (dataSetMetadata != null) {
        preparation.setRowMetadata(dataSetMetadata.getRowMetadata());
    } else {
        LOGGER.debug("The metadata of dataset {} is null and will not be used to set the metadata of preparation {}.", preparation.getDataSetId(), preparation.getId());
    }
    return preparation;
}
Also used : DataSetMetadata(org.talend.dataprep.api.dataset.DataSetMetadata)

Aggregations

DataSetMetadata (org.talend.dataprep.api.dataset.DataSetMetadata)192 Test (org.junit.Test)126 DataSetBaseTest (org.talend.dataprep.dataset.DataSetBaseTest)63 ColumnMetadata (org.talend.dataprep.api.dataset.ColumnMetadata)48 InputStream (java.io.InputStream)45 Matchers.containsString (org.hamcrest.Matchers.containsString)28 Matchers.isEmptyString (org.hamcrest.Matchers.isEmptyString)28 TDPException (org.talend.dataprep.exception.TDPException)26 RowMetadata (org.talend.dataprep.api.dataset.RowMetadata)20 DataSetServiceTest (org.talend.dataprep.dataset.service.DataSetServiceTest)20 ApiOperation (io.swagger.annotations.ApiOperation)18 DataSet (org.talend.dataprep.api.dataset.DataSet)18 Type (org.talend.dataprep.api.type.Type)17 Timed (org.talend.dataprep.metrics.Timed)17 DistributedLock (org.talend.dataprep.lock.DistributedLock)16 Autowired (org.springframework.beans.factory.annotation.Autowired)14 DataSetRow (org.talend.dataprep.api.dataset.row.DataSetRow)14 IOException (java.io.IOException)13 RequestMapping (org.springframework.web.bind.annotation.RequestMapping)13 ArrayList (java.util.ArrayList)12