Search in sources :

Example 21 with CompressionCodecFactory

use of org.apache.hadoop.io.compress.CompressionCodecFactory in project jena by apache.

the class AbstractWholeFileNodeTupleReader method initialize.

@Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
    LOG.debug("initialize({}, {})", genericSplit, context);
    // Assuming file split
    if (!(genericSplit instanceof FileSplit))
        throw new IOException("This record reader only supports FileSplit inputs");
    FileSplit split = (FileSplit) genericSplit;
    // Configuration
    Configuration config = context.getConfiguration();
    this.ignoreBadTuples = config.getBoolean(RdfIOConstants.INPUT_IGNORE_BAD_TUPLES, true);
    if (this.ignoreBadTuples)
        LOG.warn("Configured to ignore bad tuples, parsing errors will be logged and further parsing aborted but no user visible errors will be thrown.  Consider setting {} to false to disable this behaviour", RdfIOConstants.INPUT_IGNORE_BAD_TUPLES);
    // Figure out what portion of the file to read
    if (split.getStart() > 0)
        throw new IOException("This record reader requires a file split which covers the entire file");
    final Path file = split.getPath();
    long totalLength = file.getFileSystem(context.getConfiguration()).getFileStatus(file).getLen();
    CompressionCodecFactory factory = new CompressionCodecFactory(config);
    this.compressionCodecs = factory.getCodec(file);
    LOG.info(String.format("Got split with start %d and length %d for file with total length of %d", new Object[] { split.getStart(), split.getLength(), totalLength }));
    if (totalLength > split.getLength())
        throw new IOException("This record reader requires a file split which covers the entire file");
    // Open the file and prepare the input stream
    FileSystem fs = file.getFileSystem(config);
    FSDataInputStream fileIn = fs.open(file);
    this.length = split.getLength();
    if (this.compressionCodecs != null) {
        // Compressed input
        input = new TrackedInputStream(this.compressionCodecs.createInputStream(fileIn));
    } else {
        // Uncompressed input
        input = new TrackedInputStream(fileIn);
    }
    // Set up background thread for parser
    iter = this.getPipedIterator();
    this.stream = this.getPipedStream(iter, this.input);
    RDFParserBuilder builder = RdfIOUtils.createRDFParserBuilder(context, file);
    Runnable parserRunnable = this.createRunnable(this, this.input, stream, this.getRdfLanguage(), builder);
    this.parserThread = new Thread(parserRunnable);
    this.parserThread.setDaemon(true);
    this.parserThread.start();
}
Also used : Path(org.apache.hadoop.fs.Path) Configuration(org.apache.hadoop.conf.Configuration) IOException(java.io.IOException) FileSplit(org.apache.hadoop.mapreduce.lib.input.FileSplit) CompressionCodecFactory(org.apache.hadoop.io.compress.CompressionCodecFactory) FileSystem(org.apache.hadoop.fs.FileSystem) RDFParserBuilder(org.apache.jena.riot.RDFParserBuilder) TrackedInputStream(org.apache.jena.hadoop.rdf.io.input.util.TrackedInputStream) FSDataInputStream(org.apache.hadoop.fs.FSDataInputStream)

Example 22 with CompressionCodecFactory

use of org.apache.hadoop.io.compress.CompressionCodecFactory in project hadoop by apache.

the class ITestS3AInputStreamPerformance method executeDecompression.

/**
   * Execute a decompression + line read with the given input policy.
   * @param readahead byte readahead
   * @param inputPolicy read policy
   * @throws IOException IO Problems
   */
private void executeDecompression(long readahead, S3AInputPolicy inputPolicy) throws IOException {
    CompressionCodecFactory factory = new CompressionCodecFactory(getConf());
    CompressionCodec codec = factory.getCodec(testData);
    long bytesRead = 0;
    int lines = 0;
    FSDataInputStream objectIn = openTestFile(inputPolicy, readahead);
    ContractTestUtils.NanoTimer timer = new ContractTestUtils.NanoTimer();
    try (LineReader lineReader = new LineReader(codec.createInputStream(objectIn), getConf())) {
        Text line = new Text();
        int read;
        while ((read = lineReader.readLine(line)) > 0) {
            bytesRead += read;
            lines++;
        }
    } catch (EOFException eof) {
    // done
    }
    timer.end("Time to read %d lines [%d bytes expanded, %d raw]" + " with readahead = %d", lines, bytesRead, testDataStatus.getLen(), readahead);
    logTimePerIOP("line read", timer, lines);
    logStreamStatistics();
}
Also used : CompressionCodecFactory(org.apache.hadoop.io.compress.CompressionCodecFactory) ContractTestUtils(org.apache.hadoop.fs.contract.ContractTestUtils) LineReader(org.apache.hadoop.util.LineReader) EOFException(java.io.EOFException) FSDataInputStream(org.apache.hadoop.fs.FSDataInputStream) Text(org.apache.hadoop.io.Text) CompressionCodec(org.apache.hadoop.io.compress.CompressionCodec)

Aggregations

CompressionCodecFactory (org.apache.hadoop.io.compress.CompressionCodecFactory)22 CompressionCodec (org.apache.hadoop.io.compress.CompressionCodec)18 FileSystem (org.apache.hadoop.fs.FileSystem)14 FSDataInputStream (org.apache.hadoop.fs.FSDataInputStream)9 Path (org.apache.hadoop.fs.Path)9 Configuration (org.apache.hadoop.conf.Configuration)7 IOException (java.io.IOException)6 DataInputStream (java.io.DataInputStream)4 Text (org.apache.hadoop.io.Text)3 FileSplit (org.apache.hadoop.mapreduce.lib.input.FileSplit)3 LineReader (org.apache.hadoop.util.LineReader)3 InputStream (java.io.InputStream)2 OutputStream (java.io.OutputStream)2 PcapReader (net.ripe.hadoop.pcap.PcapReader)2 CompressionInputStream (org.apache.hadoop.io.compress.CompressionInputStream)2 JobConf (org.apache.hadoop.mapred.JobConf)2 RDFParserBuilder (org.apache.jena.riot.RDFParserBuilder)2 JsonGenerator (com.fasterxml.jackson.core.JsonGenerator)1 Slice (io.airlift.slice.Slice)1 BufferedInputStream (java.io.BufferedInputStream)1