Search in sources :

Example 1 with FixedLengthCharsetTransformingCodec

use of io.cdap.plugin.format.charset.fixedlength.FixedLengthCharsetTransformingCodec in project hydrator-plugins by cdapio.

the class CharsetTransformingLineRecordReader method initialize.

/**
 * Initialize method from parent class, simplified for this our use case from the base class.
 *
 * @param genericSplit File Split
 * @param context      Execution context
 * @throws IOException if the underlying file or decompression operations fail.
 */
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();
    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    FSDataInputStream fileIn = fs.open(file);
    SplittableCompressionCodec codec = new FixedLengthCharsetTransformingCodec(fixedLengthCharset);
    decompressor = codec.createDecompressor();
    final SplitCompressionInputStream cIn = codec.createInputStream(fileIn, decompressor, start, end, SplittableCompressionCodec.READ_MODE.CONTINUOUS);
    in = new CompressedSplitLineReader(cIn, job, this.recordDelimiterBytes);
    start = cIn.getAdjustedStart();
    end = cIn.getAdjustedEnd();
    filePosition = cIn;
    // next() method.
    if (start != 0) {
        Text t = new Text();
        start += in.readLine(t, 4096, Integer.MAX_VALUE);
        LOG.info("Discarded line: " + t.toString());
    }
    this.pos = start;
}
Also used : Path(org.apache.hadoop.fs.Path) SplittableCompressionCodec(org.apache.hadoop.io.compress.SplittableCompressionCodec) Configuration(org.apache.hadoop.conf.Configuration) SplitCompressionInputStream(org.apache.hadoop.io.compress.SplitCompressionInputStream) FixedLengthCharsetTransformingCodec(io.cdap.plugin.format.charset.fixedlength.FixedLengthCharsetTransformingCodec) FileSystem(org.apache.hadoop.fs.FileSystem) FSDataInputStream(org.apache.hadoop.fs.FSDataInputStream) CompressedSplitLineReader(org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader) Text(org.apache.hadoop.io.Text) FileSplit(org.apache.hadoop.mapreduce.lib.input.FileSplit)

Example 2 with FixedLengthCharsetTransformingCodec

use of io.cdap.plugin.format.charset.fixedlength.FixedLengthCharsetTransformingCodec in project hydrator-plugins by cdapio.

the class CharsetTransformingLineRecordReaderTest method before.

@Before
public void before() throws IOException {
    // Set up the Compressed Split Line Reader with a buffer size of 4096 bytes.
    // This ensures the buffer will consume all characters in the input stream if we allow it to.
    conf = new Configuration();
    conf.setInt("io.file.buffer.size", 4096);
    fixedLengthCharset = FixedLengthCharset.UTF_32;
    codec = new FixedLengthCharsetTransformingCodec(fixedLengthCharset);
    codec.setConf(conf);
    inputStream = new SeekableByteArrayInputStream(input.getBytes(fixedLengthCharset.getCharset()));
    availableBytes = inputStream.available();
}
Also used : Configuration(org.apache.hadoop.conf.Configuration) FixedLengthCharsetTransformingCodec(io.cdap.plugin.format.charset.fixedlength.FixedLengthCharsetTransformingCodec) Before(org.junit.Before)

Aggregations

FixedLengthCharsetTransformingCodec (io.cdap.plugin.format.charset.fixedlength.FixedLengthCharsetTransformingCodec)2 Configuration (org.apache.hadoop.conf.Configuration)2 FSDataInputStream (org.apache.hadoop.fs.FSDataInputStream)1 FileSystem (org.apache.hadoop.fs.FileSystem)1 Path (org.apache.hadoop.fs.Path)1 Text (org.apache.hadoop.io.Text)1 SplitCompressionInputStream (org.apache.hadoop.io.compress.SplitCompressionInputStream)1 SplittableCompressionCodec (org.apache.hadoop.io.compress.SplittableCompressionCodec)1 CompressedSplitLineReader (org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader)1 FileSplit (org.apache.hadoop.mapreduce.lib.input.FileSplit)1 Before (org.junit.Before)1