Search in sources :

Example 96 with CompressionCodec

use of org.apache.hadoop.io.compress.CompressionCodec in project presto by prestodb.

the class HiveWriteUtils method createRcFileWriter.

private static RecordWriter createRcFileWriter(Path target, JobConf conf, Properties properties, boolean compress) throws IOException {
    int columns = properties.getProperty(META_TABLE_COLUMNS).split(",").length;
    RCFileOutputFormat.setColumnNumber(conf, columns);
    CompressionCodec codec = null;
    if (compress) {
        codec = ReflectionUtil.newInstance(getOutputCompressorClass(conf, DefaultCodec.class), conf);
    }
    RCFile.Writer writer = new RCFile.Writer(target.getFileSystem(conf), conf, target, () -> {
    }, codec);
    return new ExtendedRecordWriter() {

        private long length;

        @Override
        public long getWrittenBytes() {
            return length;
        }

        @Override
        public void write(Writable value) throws IOException {
            writer.append(value);
            length = writer.getLength();
        }

        @Override
        public void close(boolean abort) throws IOException {
            writer.close();
            if (!abort) {
                length = target.getFileSystem(conf).getFileStatus(target).getLen();
            }
        }
    };
}
Also used : RCFile(org.apache.hadoop.hive.ql.io.RCFile) ExtendedRecordWriter(com.facebook.presto.hive.RecordFileWriter.ExtendedRecordWriter) DateWritable(org.apache.hadoop.hive.serde2.io.DateWritable) Writable(org.apache.hadoop.io.Writable) IntWritable(org.apache.hadoop.io.IntWritable) BooleanWritable(org.apache.hadoop.io.BooleanWritable) DoubleWritable(org.apache.hadoop.hive.serde2.io.DoubleWritable) FloatWritable(org.apache.hadoop.io.FloatWritable) LongWritable(org.apache.hadoop.io.LongWritable) ShortWritable(org.apache.hadoop.hive.serde2.io.ShortWritable) ByteWritable(org.apache.hadoop.io.ByteWritable) BytesWritable(org.apache.hadoop.io.BytesWritable) TimestampWritable(org.apache.hadoop.hive.serde2.io.TimestampWritable) HiveDecimalWritable(org.apache.hadoop.hive.serde2.io.HiveDecimalWritable) CompressionCodec(org.apache.hadoop.io.compress.CompressionCodec) ExtendedRecordWriter(com.facebook.presto.hive.RecordFileWriter.ExtendedRecordWriter) RecordWriter(org.apache.hadoop.hive.ql.exec.FileSinkOperator.RecordWriter) ParquetRecordWriterUtil.createParquetWriter(com.facebook.presto.hive.ParquetRecordWriterUtil.createParquetWriter)

Example 97 with CompressionCodec

use of org.apache.hadoop.io.compress.CompressionCodec in project shifu by ShifuML.

the class CombineRecordReader method initializeOne.

public void initializeOne(FileSplit split, TaskAttemptContext context) throws IOException {
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();
    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    fileIn = fs.open(file);
    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    if (null != codec) {
        isCompressedInput = true;
        decompressor = CodecPool.getDecompressor(codec);
        if (codec instanceof SplittableCompressionCodec) {
            final SplitCompressionInputStream cIn = ((SplittableCompressionCodec) codec).createInputStream(fileIn, decompressor, start, end, SplittableCompressionCodec.READ_MODE.BYBLOCK);
            in = new CompressedSplitLineReader(cIn, job, this.recordDelimiterBytes);
            start = cIn.getAdjustedStart();
            end = cIn.getAdjustedEnd();
            filePosition = cIn;
        } else {
            in = new SplitLineReader(codec.createInputStream(fileIn, decompressor), job, this.recordDelimiterBytes);
            filePosition = fileIn;
        }
    } else {
        fileIn.seek(start);
        in = new SplitLineReader(fileIn, job, this.recordDelimiterBytes);
        filePosition = fileIn;
    }
    // next() method.
    if (start != 0) {
        start += in.readLine(new Text(), 0, maxBytesToConsume(start));
    }
    this.pos = start;
}
Also used : Path(org.apache.hadoop.fs.Path) SplitLineReader(org.apache.hadoop.mapreduce.lib.input.SplitLineReader) CompressedSplitLineReader(org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader) SplittableCompressionCodec(org.apache.hadoop.io.compress.SplittableCompressionCodec) Configuration(org.apache.hadoop.conf.Configuration) CompressionCodecFactory(org.apache.hadoop.io.compress.CompressionCodecFactory) SplitCompressionInputStream(org.apache.hadoop.io.compress.SplitCompressionInputStream) FileSystem(org.apache.hadoop.fs.FileSystem) CompressedSplitLineReader(org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader) Text(org.apache.hadoop.io.Text) CompressionCodec(org.apache.hadoop.io.compress.CompressionCodec) SplittableCompressionCodec(org.apache.hadoop.io.compress.SplittableCompressionCodec)

Example 98 with CompressionCodec

use of org.apache.hadoop.io.compress.CompressionCodec in project shifu by ShifuML.

the class ShifuFileUtils method readFilePartsIntoList.

public static List<String> readFilePartsIntoList(String filePath, SourceType sourceType) throws IOException {
    List<String> lines = new ArrayList<String>();
    FileSystem fs = getFileSystemBySourceType(sourceType);
    FileStatus[] fileStatsArr = getFilePartStatus(filePath, sourceType);
    CompressionCodecFactory compressionFactory = new CompressionCodecFactory(new Configuration());
    for (FileStatus fileStatus : fileStatsArr) {
        InputStream is = null;
        CompressionCodec codec = compressionFactory.getCodec(fileStatus.getPath());
        if (codec != null) {
            is = codec.createInputStream(fs.open(fileStatus.getPath()));
        } else {
            is = fs.open(fileStatus.getPath());
        }
        lines.addAll(IOUtils.readLines(is));
        IOUtils.closeQuietly(is);
    }
    return lines;
}
Also used : FileStatus(org.apache.hadoop.fs.FileStatus) CompressionCodecFactory(org.apache.hadoop.io.compress.CompressionCodecFactory) Configuration(org.apache.hadoop.conf.Configuration) GZIPInputStream(java.util.zip.GZIPInputStream) BufferedInputStream(java.io.BufferedInputStream) BZip2CompressorInputStream(org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream) FSDataInputStream(org.apache.hadoop.fs.FSDataInputStream) InputStream(java.io.InputStream) FileSystem(org.apache.hadoop.fs.FileSystem) ArrayList(java.util.ArrayList) CompressionCodec(org.apache.hadoop.io.compress.CompressionCodec)

Example 99 with CompressionCodec

use of org.apache.hadoop.io.compress.CompressionCodec in project shifu by ShifuML.

the class HdfsPartFile method openPartFileAsStream.

private InputStream openPartFileAsStream(FileStatus fileStatus) throws IOException {
    CompressionCodecFactory compressionFactory = new CompressionCodecFactory(new Configuration());
    InputStream is = null;
    FileSystem fs = ShifuFileUtils.getFileSystemBySourceType(sourceType);
    CompressionCodec codec = compressionFactory.getCodec(fileStatus.getPath());
    if (codec != null) {
        is = codec.createInputStream(fs.open(fileStatus.getPath()));
    } else {
        is = fs.open(fileStatus.getPath());
    }
    return is;
}
Also used : CompressionCodecFactory(org.apache.hadoop.io.compress.CompressionCodecFactory) Configuration(org.apache.hadoop.conf.Configuration) InputStream(java.io.InputStream) FileSystem(org.apache.hadoop.fs.FileSystem) CompressionCodec(org.apache.hadoop.io.compress.CompressionCodec)

Example 100 with CompressionCodec

use of org.apache.hadoop.io.compress.CompressionCodec in project flink by apache.

the class SequenceFileWriterFactory method create.

@Override
public SequenceFileWriter<K, V> create(FSDataOutputStream out) throws IOException {
    org.apache.hadoop.fs.FSDataOutputStream stream = new org.apache.hadoop.fs.FSDataOutputStream(out, null);
    CompressionCodec compressionCodec = getCompressionCodec(serializableHadoopConfig.get(), compressionCodecName);
    SequenceFile.Writer writer = SequenceFile.createWriter(serializableHadoopConfig.get(), SequenceFile.Writer.stream(stream), SequenceFile.Writer.keyClass(keyClass), SequenceFile.Writer.valueClass(valueClass), SequenceFile.Writer.compression(compressionType, compressionCodec));
    return new SequenceFileWriter<>(writer);
}
Also used : SequenceFile(org.apache.hadoop.io.SequenceFile) FSDataOutputStream(org.apache.flink.core.fs.FSDataOutputStream) CompressionCodec(org.apache.hadoop.io.compress.CompressionCodec)

Aggregations

CompressionCodec (org.apache.hadoop.io.compress.CompressionCodec)111 Path (org.apache.hadoop.fs.Path)54 FileSystem (org.apache.hadoop.fs.FileSystem)41 Configuration (org.apache.hadoop.conf.Configuration)38 CompressionCodecFactory (org.apache.hadoop.io.compress.CompressionCodecFactory)37 InputStream (java.io.InputStream)18 IOException (java.io.IOException)17 Test (org.junit.Test)17 FSDataInputStream (org.apache.hadoop.fs.FSDataInputStream)15 Text (org.apache.hadoop.io.Text)14 Configurable (org.apache.hadoop.conf.Configurable)10 GzipCodec (org.apache.hadoop.io.compress.GzipCodec)10 JobConf (org.apache.hadoop.mapred.JobConf)10 SequenceFile (org.apache.hadoop.io.SequenceFile)9 OutputStream (java.io.OutputStream)8 DefaultCodec (org.apache.hadoop.io.compress.DefaultCodec)8 FileInputStream (java.io.FileInputStream)7 FSDataOutputStream (org.apache.hadoop.fs.FSDataOutputStream)6 ByteString (com.google.protobuf.ByteString)5 DataInputStream (java.io.DataInputStream)5