Search in sources :

Example 1 with PreprocessedReader

use of com.bakdata.conquery.models.preproc.PreprocessedReader in project conquery by bakdata.

the class PreprocessorCommand method requiresProcessing.

@SneakyThrows
public static boolean requiresProcessing(PreprocessingJob preprocessingJob) {
    ConqueryMDC.setLocation(preprocessingJob.toString());
    if (preprocessingJob.getPreprocessedFile().exists()) {
        log.info("EXISTS ALREADY");
        int currentHash = preprocessingJob.getDescriptor().calculateValidityHash(preprocessingJob.getCsvDirectory(), preprocessingJob.getTag());
        try (final PreprocessedReader parser = new PreprocessedReader(new GZIPInputStream(new FileInputStream(preprocessingJob.getPreprocessedFile())))) {
            PreprocessedHeader header = parser.readHeader();
            if (header.getValidityHash() == currentHash) {
                log.info("\tHASH STILL VALID");
                return false;
            }
            log.info("\tHASH OUTDATED");
        } catch (Exception e) {
            log.error("\tHEADER READING FAILED", e);
            return false;
        }
    } else {
        log.info("DOES NOT EXIST");
    }
    return true;
}
Also used : GZIPInputStream(java.util.zip.GZIPInputStream) PreprocessedHeader(com.bakdata.conquery.models.preproc.PreprocessedHeader) PreprocessedReader(com.bakdata.conquery.models.preproc.PreprocessedReader) FileInputStream(java.io.FileInputStream) IOException(java.io.IOException) FileNotFoundException(java.io.FileNotFoundException) SneakyThrows(lombok.SneakyThrows)

Example 2 with PreprocessedReader

use of com.bakdata.conquery.models.preproc.PreprocessedReader in project conquery by bakdata.

the class ImportJob method createOrUpdate.

public static ImportJob createOrUpdate(Namespace namespace, InputStream inputStream, int entityBucketSize, IdMutex<DictionaryId> sharedDictionaryLocks, ConqueryConfig config, boolean update) throws IOException {
    try (PreprocessedReader parser = new PreprocessedReader(inputStream)) {
        final Dataset ds = namespace.getDataset();
        // We parse semi-manually as the incoming file consist of multiple documents we only read progressively:
        // 1) the header to check metadata
        // 2) The Dictionaries to be imported and transformed
        // 3) The ColumnStores themselves which contain references to the previously imported dictionaries.
        final PreprocessedHeader header = parser.readHeader();
        final TableId tableId = new TableId(ds.getId(), header.getTable());
        Table table = namespace.getStorage().getTable(tableId);
        if (table == null) {
            throw new BadRequestException(String.format("Table[%s] does not exist.", tableId));
        }
        // Ensure that Import and Table have the same schema
        header.assertMatch(table);
        final ImportId importId = new ImportId(table.getId(), header.getName());
        Import processedImport = namespace.getStorage().getImport(importId);
        if (update) {
            if (processedImport == null) {
                throw new WebApplicationException(String.format("Import[%s] is not present.", importId), Response.Status.NOT_FOUND);
            }
            // before updating the import, make sure that all workers removed the last import
            namespace.sendToAll(new RemoveImportJob(processedImport));
            namespace.getStorage().removeImport(importId);
        } else if (processedImport != null) {
            throw new WebApplicationException(String.format("Import[%s] is already present.", importId), Response.Status.CONFLICT);
        }
        log.trace("Begin reading Dictionaries");
        parser.addReplacement(Dataset.PLACEHOLDER.getId(), ds);
        PreprocessedDictionaries dictionaries = parser.readDictionaries();
        Map<DictionaryId, Dictionary> dictReplacements = createLocalIdReplacements(dictionaries.getDictionaries(), table, header.getName(), namespace.getStorage(), sharedDictionaryLocks);
        // We inject the mappings into the parser, so that the incoming placeholder names are replaced with the new names of the dictionaries. This allows us to use NsIdRef in conjunction with shared-Dictionaries
        parser.addAllReplacements(dictReplacements);
        log.trace("Begin reading data.");
        PreprocessedData container = parser.readData();
        log.debug("Done reading data. Contains {} Entities.", container.size());
        log.info("Importing {} into {}", header.getName(), tableId);
        return new ImportJob(namespace, table, entityBucketSize, header, dictionaries, container, config);
    }
}
Also used : PreprocessedDictionaries(com.bakdata.conquery.models.preproc.PreprocessedDictionaries) Dictionary(com.bakdata.conquery.models.dictionary.Dictionary) WebApplicationException(javax.ws.rs.WebApplicationException) PreprocessedData(com.bakdata.conquery.models.preproc.PreprocessedData) BadRequestException(javax.ws.rs.BadRequestException) PreprocessedHeader(com.bakdata.conquery.models.preproc.PreprocessedHeader) PreprocessedReader(com.bakdata.conquery.models.preproc.PreprocessedReader)

Aggregations

PreprocessedHeader (com.bakdata.conquery.models.preproc.PreprocessedHeader)2 PreprocessedReader (com.bakdata.conquery.models.preproc.PreprocessedReader)2 Dictionary (com.bakdata.conquery.models.dictionary.Dictionary)1 PreprocessedData (com.bakdata.conquery.models.preproc.PreprocessedData)1 PreprocessedDictionaries (com.bakdata.conquery.models.preproc.PreprocessedDictionaries)1 FileInputStream (java.io.FileInputStream)1 FileNotFoundException (java.io.FileNotFoundException)1 IOException (java.io.IOException)1 GZIPInputStream (java.util.zip.GZIPInputStream)1 BadRequestException (javax.ws.rs.BadRequestException)1 WebApplicationException (javax.ws.rs.WebApplicationException)1 SneakyThrows (lombok.SneakyThrows)1