Search in sources :

Example 1 with EDataCorruption

use of org.finos.tracdap.common.exception.EDataCorruption in project tracdap by finos.

the class ArrowStreamDecoder method decodeChunk.

@Override
protected void decodeChunk(ByteBuf chunk) {
    try (var stream = new ByteSeekableChannel(chunk)) {
        // Arrow does not attempt to validate the stream before reading
        // This quick validation peeks at the start of the stream for a basic sanity check
        // It should be enough to flag e.g. if data has been sent in a totally different format
        // Make sure to do this check before setting up reader + root,
        // since that will trigger reading the initial schema message
        validateStartOfStream(stream);
        try (var reader = new ArrowStreamReader(stream, arrowAllocator);
            var root = reader.getVectorSchemaRoot()) {
            var schema = root.getSchema();
            emitBlock(DataBlock.forSchema(schema));
            var unloader = new VectorUnloader(root);
            while (reader.loadNextBatch()) {
                var batch = unloader.getRecordBatch();
                emitBlock(DataBlock.forRecords(batch));
                // Release memory retained in VSR (batch still has a reference)
                root.clear();
            }
        }
    } catch (NotAnArrowStream e) {
        // A nice clean validation exception
        var errorMessage = "Arrow stream decoding failed, content does not look like an Arrow stream";
        log.error(errorMessage, e);
        throw new EDataCorruption(errorMessage, e);
    } catch (IllegalArgumentException | IndexOutOfBoundsException | IOException e) {
        // These errors occur if the data stream contains bad values for vector sizes, offsets etc.
        // This may be as a result of a corrupt data stream, or a maliciously crafted message
        // Decoders work on a stream of buffers, "real" IO exceptions should not occur
        var errorMessage = "Arrow stream decoding failed, content is garbled";
        log.error(errorMessage, e);
        throw new EDataCorruption(errorMessage, e);
    } catch (Throwable e) {
        // Ensure unexpected errors are still reported to the Flow API
        log.error("Unexpected error in Arrow stream decoding", e);
        throw new EUnexpected(e);
    } finally {
        chunk.release();
    }
}
Also used : VectorUnloader(org.apache.arrow.vector.VectorUnloader) ByteSeekableChannel(org.finos.tracdap.common.util.ByteSeekableChannel) ArrowStreamReader(org.apache.arrow.vector.ipc.ArrowStreamReader) EDataCorruption(org.finos.tracdap.common.exception.EDataCorruption) IOException(java.io.IOException) EUnexpected(org.finos.tracdap.common.exception.EUnexpected)

Example 2 with EDataCorruption

use of org.finos.tracdap.common.exception.EDataCorruption in project tracdap by finos.

the class JsonDecoder method decodeChunk.

@Override
protected void decodeChunk(ByteBuf chunk) {
    try {
        var bytes = new byte[chunk.readableBytes()];
        chunk.readBytes(bytes);
        parser.feedInput(bytes, 0, bytes.length);
        JsonToken token;
        while ((token = parser.nextToken()) != JsonToken.NOT_AVAILABLE) parser.acceptToken(token);
    } catch (JacksonException e) {
        // This exception is a "well-behaved" parse failure, parse location and message should be meaningful
        var errorMessage = String.format("JSON decoding failed on line %d: %s", e.getLocation().getLineNr(), e.getOriginalMessage());
        log.error(errorMessage, e);
        throw new EDataCorruption(errorMessage, e);
    } catch (IOException e) {
        // Decoders work on a stream of buffers, "real" IO exceptions should not occur
        // IO exceptions here indicate parse failures, not file/socket communication errors
        // This is likely to be a more "badly-behaved" failure, or at least one that was not anticipated
        var errorMessage = "JSON decoding failed, content is garbled: " + e.getMessage();
        log.error(errorMessage, e);
        throw new EDataCorruption(errorMessage, e);
    } catch (Throwable e) {
        // Ensure unexpected errors are still reported to the Flow API
        log.error("Unexpected error during decoding", e);
        throw new EUnexpected(e);
    } finally {
        chunk.release();
    }
}
Also used : JacksonException(com.fasterxml.jackson.core.JacksonException) EDataCorruption(org.finos.tracdap.common.exception.EDataCorruption) JsonToken(com.fasterxml.jackson.core.JsonToken) IOException(java.io.IOException) EUnexpected(org.finos.tracdap.common.exception.EUnexpected)

Example 3 with EDataCorruption

use of org.finos.tracdap.common.exception.EDataCorruption in project tracdap by finos.

the class ArrowFileDecoder method decodeChunk.

@Override
protected void decodeChunk(ByteBuf chunk) {
    try (var stream = new ByteSeekableChannel(chunk);
        var reader = new ArrowFileReader(stream, arrowAllocator);
        var root = reader.getVectorSchemaRoot()) {
        var schema = root.getSchema();
        emitBlock(DataBlock.forSchema(schema));
        var unloader = new VectorUnloader(root);
        while (reader.loadNextBatch()) {
            var batch = unloader.getRecordBatch();
            emitBlock(DataBlock.forRecords(batch));
        }
    } catch (InvalidArrowFileException e) {
        // A nice clean validation failure from the Arrow framework
        // E.g. missing / incorrect magic number at the start (or end) of the file
        var errorMessage = "Arrow file decoding failed, file is invalid: " + e.getMessage();
        log.error(errorMessage, e);
        throw new EDataCorruption(errorMessage, e);
    } catch (IllegalArgumentException | IndexOutOfBoundsException | IOException e) {
        // These errors occur if the data stream contains bad values for vector sizes, offsets etc.
        // This may be as a result of a corrupt data stream, or a maliciously crafted message
        // Decoders work on a stream of buffers, "real" IO exceptions should not occur
        var errorMessage = "Arrow file decoding failed, content is garbled";
        log.error(errorMessage, e);
        throw new EDataCorruption(errorMessage, e);
    } catch (Throwable e) {
        // Ensure unexpected errors are still reported to the Flow API
        log.error("Unexpected error in Arrow file decoding", e);
        throw new EUnexpected(e);
    } finally {
        chunk.release();
    }
}
Also used : VectorUnloader(org.apache.arrow.vector.VectorUnloader) ByteSeekableChannel(org.finos.tracdap.common.util.ByteSeekableChannel) ArrowFileReader(org.apache.arrow.vector.ipc.ArrowFileReader) InvalidArrowFileException(org.apache.arrow.vector.ipc.InvalidArrowFileException) EDataCorruption(org.finos.tracdap.common.exception.EDataCorruption) IOException(java.io.IOException) EUnexpected(org.finos.tracdap.common.exception.EUnexpected)

Example 4 with EDataCorruption

use of org.finos.tracdap.common.exception.EDataCorruption in project tracdap by finos.

the class CsvDecoder method decodeChunk.

@Override
protected void decodeChunk(ByteBuf chunk) {
    var csvFactory = new CsvFactory().enable(CsvParser.Feature.FAIL_ON_MISSING_COLUMNS).enable(CsvParser.Feature.EMPTY_STRING_AS_NULL).enable(CsvParser.Feature.TRIM_SPACES);
    try (var stream = new ByteBufInputStream(chunk);
        var parser = (CsvParser) csvFactory.createParser((InputStream) stream)) {
        var csvSchema = CsvSchemaMapping.arrowToCsv(this.arrowSchema).build();
        csvSchema = DEFAULT_HEADER_FLAG ? csvSchema.withHeader() : csvSchema.withoutHeader();
        parser.setSchema(csvSchema);
        var row = 0;
        var col = 0;
        JsonToken token;
        while ((token = parser.nextToken()) != null) {
            switch(token) {
                // For CSV files, a null field name is produced for every field
                case FIELD_NAME:
                    continue;
                case VALUE_NULL:
                    // Special handling to differentiate between null and empty strings
                    var nullVector = root.getVector(col);
                    var minorType = nullVector.getMinorType();
                    if (minorType == Types.MinorType.VARCHAR) {
                        // Null strings are encoded with no space between commas (or EOL): some_value,,next_value
                        // An empty string is encoded as "", i.e. token width = 2 (or more with padding)
                        // Using token end - token start, a gap between commas -> empty string instead of null
                        // It would be nicer to check the original bytes to see if there are quote chars in there
                        // But this is not possible with the current Jackson API
                        var tokenStart = parser.currentTokenLocation();
                        var tokenEnd = parser.currentLocation();
                        var tokenWidth = tokenEnd.getColumnNr() - tokenStart.getColumnNr();
                        if (tokenWidth > 1) {
                            JacksonValues.setEmptyString(nullVector, row);
                            col++;
                            continue;
                        }
                    }
                case VALUE_TRUE:
                case VALUE_FALSE:
                case VALUE_STRING:
                case VALUE_NUMBER_INT:
                case VALUE_NUMBER_FLOAT:
                    var vector = root.getVector(col);
                    JacksonValues.parseAndSet(vector, row, parser, token);
                    col++;
                    break;
                case START_OBJECT:
                    if (row == 0)
                        for (var vector_ : root.getFieldVectors()) vector_.allocateNew();
                    break;
                case END_OBJECT:
                    row++;
                    col = 0;
                    if (row == BATCH_SIZE) {
                        root.setRowCount(row);
                        dispatchBatch(root);
                        row = 0;
                    }
                    break;
                default:
                    var msg = String.format("Unexpected token %s", token.name());
                    throw new CsvReadException(parser, msg, csvSchema);
            }
        }
        if (row > 0 || col > 0) {
            root.setRowCount(row);
            dispatchBatch(root);
        }
    } catch (JacksonException e) {
        // This exception is a "well-behaved" parse failure, parse location and message should be meaningful
        var errorMessage = String.format("CSV decoding failed on line %d: %s", e.getLocation().getLineNr(), e.getOriginalMessage());
        log.error(errorMessage, e);
        throw new EDataCorruption(errorMessage, e);
    } catch (IOException e) {
        // Decoders work on a stream of buffers, "real" IO exceptions should not occur
        // IO exceptions here indicate parse failures, not file/socket communication errors
        // This is likely to be a more "badly-behaved" failure, or at least one that was not anticipated
        var errorMessage = "CSV decoding failed, content is garbled: " + e.getMessage();
        log.error(errorMessage, e);
        throw new EDataCorruption(errorMessage, e);
    } catch (Throwable e) {
        // Ensure unexpected errors are still reported to the Flow API
        log.error("Unexpected error in CSV decoding", e);
        throw new EUnexpected(e);
    } finally {
        chunk.release();
    }
}
Also used : CsvFactory(com.fasterxml.jackson.dataformat.csv.CsvFactory) JacksonException(com.fasterxml.jackson.core.JacksonException) CsvReadException(com.fasterxml.jackson.dataformat.csv.CsvReadException) ByteBufInputStream(io.netty.buffer.ByteBufInputStream) InputStream(java.io.InputStream) EDataCorruption(org.finos.tracdap.common.exception.EDataCorruption) CsvParser(com.fasterxml.jackson.dataformat.csv.CsvParser) JsonToken(com.fasterxml.jackson.core.JsonToken) ByteBufInputStream(io.netty.buffer.ByteBufInputStream) IOException(java.io.IOException) EUnexpected(org.finos.tracdap.common.exception.EUnexpected)

Aggregations

IOException (java.io.IOException)4 EDataCorruption (org.finos.tracdap.common.exception.EDataCorruption)4 EUnexpected (org.finos.tracdap.common.exception.EUnexpected)4 JacksonException (com.fasterxml.jackson.core.JacksonException)2 JsonToken (com.fasterxml.jackson.core.JsonToken)2 VectorUnloader (org.apache.arrow.vector.VectorUnloader)2 ByteSeekableChannel (org.finos.tracdap.common.util.ByteSeekableChannel)2 CsvFactory (com.fasterxml.jackson.dataformat.csv.CsvFactory)1 CsvParser (com.fasterxml.jackson.dataformat.csv.CsvParser)1 CsvReadException (com.fasterxml.jackson.dataformat.csv.CsvReadException)1 ByteBufInputStream (io.netty.buffer.ByteBufInputStream)1 InputStream (java.io.InputStream)1 ArrowFileReader (org.apache.arrow.vector.ipc.ArrowFileReader)1 ArrowStreamReader (org.apache.arrow.vector.ipc.ArrowStreamReader)1 InvalidArrowFileException (org.apache.arrow.vector.ipc.InvalidArrowFileException)1