Search in sources :

Example 1 with Record

use of org.apache.flink.orc.data.Record in project flink by apache.

the class OrcBulkWriterITCase method testOrcBulkWriter.

@Test
public void testOrcBulkWriter() throws Exception {
    final File outDir = TEMPORARY_FOLDER.newFolder();
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    final Properties writerProps = new Properties();
    writerProps.setProperty("orc.compress", "LZ4");
    final OrcBulkWriterFactory<Record> factory = new OrcBulkWriterFactory<>(new RecordVectorizer(schema), writerProps, new Configuration());
    env.setParallelism(1);
    env.enableCheckpointing(100);
    DataStream<Record> stream = env.addSource(new FiniteTestSource<>(testData), TypeInformation.of(Record.class));
    stream.map(str -> str).addSink(StreamingFileSink.forBulkFormat(new Path(outDir.toURI()), factory).withBucketAssigner(new UniqueBucketAssigner<>("test")).build());
    env.execute();
    OrcBulkWriterTestUtil.validate(outDir, testData);
}
Also used : Arrays(java.util.Arrays) Properties(java.util.Properties) FiniteTestSource(org.apache.flink.streaming.util.FiniteTestSource) Test(org.junit.Test) File(java.io.File) DataStream(org.apache.flink.streaming.api.datastream.DataStream) List(java.util.List) UniqueBucketAssigner(org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.UniqueBucketAssigner) Path(org.apache.flink.core.fs.Path) OrcBulkWriterTestUtil(org.apache.flink.orc.util.OrcBulkWriterTestUtil) Configuration(org.apache.hadoop.conf.Configuration) StreamingFileSink(org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink) TestLogger(org.apache.flink.util.TestLogger) Record(org.apache.flink.orc.data.Record) TypeInformation(org.apache.flink.api.common.typeinfo.TypeInformation) ClassRule(org.junit.ClassRule) RecordVectorizer(org.apache.flink.orc.vector.RecordVectorizer) TemporaryFolder(org.junit.rules.TemporaryFolder) StreamExecutionEnvironment(org.apache.flink.streaming.api.environment.StreamExecutionEnvironment) Path(org.apache.flink.core.fs.Path) Configuration(org.apache.hadoop.conf.Configuration) Properties(java.util.Properties) RecordVectorizer(org.apache.flink.orc.vector.RecordVectorizer) StreamExecutionEnvironment(org.apache.flink.streaming.api.environment.StreamExecutionEnvironment) Record(org.apache.flink.orc.data.Record) File(java.io.File) Test(org.junit.Test)

Example 2 with Record

use of org.apache.flink.orc.data.Record in project flink by apache.

the class OrcBulkWriterFactoryTest method testNotOverrideInMemoryManager.

@Test
public void testNotOverrideInMemoryManager() throws IOException {
    TestMemoryManager memoryManager = new TestMemoryManager();
    OrcBulkWriterFactory<Record> factory = new TestOrcBulkWriterFactory<>(new RecordVectorizer("struct<_col0:string,_col1:int>"), memoryManager);
    factory.create(new LocalDataOutputStream(temporaryFolder.newFile()));
    factory.create(new LocalDataOutputStream(temporaryFolder.newFile()));
    List<Path> addedWriterPath = memoryManager.getAddedWriterPath();
    assertEquals(2, addedWriterPath.size());
    assertNotEquals(addedWriterPath.get(0), addedWriterPath.get(1));
}
Also used : LocalDataOutputStream(org.apache.flink.core.fs.local.LocalDataOutputStream) Path(org.apache.hadoop.fs.Path) RecordVectorizer(org.apache.flink.orc.vector.RecordVectorizer) Record(org.apache.flink.orc.data.Record) Test(org.junit.Test)

Example 3 with Record

use of org.apache.flink.orc.data.Record in project flink by apache.

the class OrcBulkWriterTestUtil method validate.

public static void validate(File files, List<Record> expected) throws IOException {
    final File[] buckets = files.listFiles();
    assertNotNull(buckets);
    assertEquals(1, buckets.length);
    final File[] partFiles = buckets[0].listFiles();
    assertNotNull(partFiles);
    for (File partFile : partFiles) {
        assertTrue(partFile.length() > 0);
        OrcFile.ReaderOptions readerOptions = OrcFile.readerOptions(new Configuration());
        Reader reader = OrcFile.createReader(new org.apache.hadoop.fs.Path(partFile.toURI()), readerOptions);
        assertEquals(3, reader.getNumberOfRows());
        assertEquals(2, reader.getSchema().getFieldNames().size());
        assertSame(reader.getCompressionKind(), CompressionKind.LZ4);
        assertTrue(reader.hasMetadataValue(USER_METADATA_KEY));
        assertTrue(reader.getMetadataKeys().contains(USER_METADATA_KEY));
        List<Record> results = getResults(reader);
        assertEquals(3, results.size());
        assertEquals(results, expected);
    }
}
Also used : Configuration(org.apache.hadoop.conf.Configuration) OrcFile(org.apache.orc.OrcFile) RecordReader(org.apache.orc.RecordReader) Reader(org.apache.orc.Reader) Record(org.apache.flink.orc.data.Record) OrcFile(org.apache.orc.OrcFile) File(java.io.File)

Example 4 with Record

use of org.apache.flink.orc.data.Record in project flink by apache.

the class OrcBulkWriterTestUtil method getResults.

private static List<Record> getResults(Reader reader) throws IOException {
    List<Record> results = new ArrayList<>();
    RecordReader recordReader = reader.rows();
    VectorizedRowBatch batch = reader.getSchema().createRowBatch();
    while (recordReader.nextBatch(batch)) {
        BytesColumnVector stringVector = (BytesColumnVector) batch.cols[0];
        LongColumnVector intVector = (LongColumnVector) batch.cols[1];
        for (int r = 0; r < batch.size; r++) {
            String name = new String(stringVector.vector[r], stringVector.start[r], stringVector.length[r]);
            int age = (int) intVector.vector[r];
            results.add(new Record(name, age));
        }
        recordReader.close();
    }
    return results;
}
Also used : VectorizedRowBatch(org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch) RecordReader(org.apache.orc.RecordReader) ArrayList(java.util.ArrayList) BytesColumnVector(org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector) Record(org.apache.flink.orc.data.Record) LongColumnVector(org.apache.hadoop.hive.ql.exec.vector.LongColumnVector)

Example 5 with Record

use of org.apache.flink.orc.data.Record in project flink by apache.

the class OrcBulkWriterTest method testOrcBulkWriter.

@Test
public void testOrcBulkWriter() throws Exception {
    final File outDir = TEMPORARY_FOLDER.newFolder();
    final Properties writerProps = new Properties();
    writerProps.setProperty("orc.compress", "LZ4");
    final OrcBulkWriterFactory<Record> writer = new OrcBulkWriterFactory<>(new RecordVectorizer(schema), writerProps, new Configuration());
    StreamingFileSink<Record> sink = StreamingFileSink.forBulkFormat(new Path(outDir.toURI()), writer).withBucketAssigner(new UniqueBucketAssigner<>("test")).withBucketCheckInterval(10000).build();
    try (OneInputStreamOperatorTestHarness<Record, Object> testHarness = new OneInputStreamOperatorTestHarness<>(new StreamSink<>(sink), 1, 1, 0)) {
        testHarness.setup();
        testHarness.open();
        int time = 0;
        for (final Record record : input) {
            testHarness.processElement(record, ++time);
        }
        testHarness.snapshot(1, ++time);
        testHarness.notifyOfCompletedCheckpoint(1);
        OrcBulkWriterTestUtil.validate(outDir, input);
    }
}
Also used : Path(org.apache.flink.core.fs.Path) Configuration(org.apache.hadoop.conf.Configuration) OneInputStreamOperatorTestHarness(org.apache.flink.streaming.util.OneInputStreamOperatorTestHarness) Properties(java.util.Properties) RecordVectorizer(org.apache.flink.orc.vector.RecordVectorizer) Record(org.apache.flink.orc.data.Record) File(java.io.File) Test(org.junit.Test)

Aggregations

Record (org.apache.flink.orc.data.Record)5 File (java.io.File)3 RecordVectorizer (org.apache.flink.orc.vector.RecordVectorizer)3 Configuration (org.apache.hadoop.conf.Configuration)3 Test (org.junit.Test)3 Properties (java.util.Properties)2 Path (org.apache.flink.core.fs.Path)2 RecordReader (org.apache.orc.RecordReader)2 ArrayList (java.util.ArrayList)1 Arrays (java.util.Arrays)1 List (java.util.List)1 TypeInformation (org.apache.flink.api.common.typeinfo.TypeInformation)1 LocalDataOutputStream (org.apache.flink.core.fs.local.LocalDataOutputStream)1 OrcBulkWriterTestUtil (org.apache.flink.orc.util.OrcBulkWriterTestUtil)1 DataStream (org.apache.flink.streaming.api.datastream.DataStream)1 StreamExecutionEnvironment (org.apache.flink.streaming.api.environment.StreamExecutionEnvironment)1 StreamingFileSink (org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink)1 UniqueBucketAssigner (org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.UniqueBucketAssigner)1 FiniteTestSource (org.apache.flink.streaming.util.FiniteTestSource)1 OneInputStreamOperatorTestHarness (org.apache.flink.streaming.util.OneInputStreamOperatorTestHarness)1