Search in sources :

Example 1 with RowDataVectorizer

use of org.apache.flink.orc.vector.RowDataVectorizer in project flink by apache.

the class OrcFileFormatFactory method createEncodingFormat.

@Override
public EncodingFormat<BulkWriter.Factory<RowData>> createEncodingFormat(DynamicTableFactory.Context context, ReadableConfig formatOptions) {
    return new EncodingFormat<BulkWriter.Factory<RowData>>() {

        @Override
        public BulkWriter.Factory<RowData> createRuntimeEncoder(DynamicTableSink.Context sinkContext, DataType consumedDataType) {
            RowType formatRowType = (RowType) consumedDataType.getLogicalType();
            LogicalType[] orcTypes = formatRowType.getChildren().toArray(new LogicalType[0]);
            TypeDescription typeDescription = OrcSplitReaderUtil.logicalTypeToOrcType(formatRowType);
            return new OrcBulkWriterFactory<>(new RowDataVectorizer(typeDescription.toString(), orcTypes), getOrcProperties(formatOptions), new Configuration());
        }

        @Override
        public ChangelogMode getChangelogMode() {
            return ChangelogMode.insertOnly();
        }
    };
}
Also used : EncodingFormat(org.apache.flink.table.connector.format.EncodingFormat) OrcBulkWriterFactory(org.apache.flink.orc.writer.OrcBulkWriterFactory) RowData(org.apache.flink.table.data.RowData) RowDataVectorizer(org.apache.flink.orc.vector.RowDataVectorizer) Configuration(org.apache.hadoop.conf.Configuration) BulkWriter(org.apache.flink.api.common.serialization.BulkWriter) DataType(org.apache.flink.table.types.DataType) RowType(org.apache.flink.table.types.logical.RowType) LogicalType(org.apache.flink.table.types.logical.LogicalType) TypeDescription(org.apache.orc.TypeDescription)

Example 2 with RowDataVectorizer

use of org.apache.flink.orc.vector.RowDataVectorizer in project flink by apache.

the class OrcBulkRowDataWriterTest method testOrcBulkWriterWithRowData.

@Test
public void testOrcBulkWriterWithRowData() throws Exception {
    final File outDir = TEMPORARY_FOLDER.newFolder();
    final Properties writerProps = new Properties();
    writerProps.setProperty("orc.compress", "LZ4");
    final OrcBulkWriterFactory<RowData> writer = new OrcBulkWriterFactory<>(new RowDataVectorizer(schema, fieldTypes), writerProps, new Configuration());
    StreamingFileSink<RowData> sink = StreamingFileSink.forBulkFormat(new Path(outDir.toURI()), writer).withBucketAssigner(new UniqueBucketAssigner<>("test")).withBucketCheckInterval(10000).build();
    try (OneInputStreamOperatorTestHarness<RowData, Object> testHarness = new OneInputStreamOperatorTestHarness<>(new StreamSink<>(sink), 1, 1, 0)) {
        testHarness.setup();
        testHarness.open();
        int time = 0;
        for (final RowData record : input) {
            testHarness.processElement(record, ++time);
        }
        testHarness.snapshot(1, ++time);
        testHarness.notifyOfCompletedCheckpoint(1);
        validate(outDir, input);
    }
}
Also used : Path(org.apache.flink.core.fs.Path) Configuration(org.apache.hadoop.conf.Configuration) OneInputStreamOperatorTestHarness(org.apache.flink.streaming.util.OneInputStreamOperatorTestHarness) Properties(java.util.Properties) GenericRowData(org.apache.flink.table.data.GenericRowData) RowData(org.apache.flink.table.data.RowData) RowDataVectorizer(org.apache.flink.orc.vector.RowDataVectorizer) OrcFile(org.apache.orc.OrcFile) File(java.io.File) Test(org.junit.Test)

Example 3 with RowDataVectorizer

use of org.apache.flink.orc.vector.RowDataVectorizer in project flink by apache.

the class OrcFileSystemITCase method initNestedTypesFile.

private String initNestedTypesFile(List<RowData> data) throws Exception {
    LogicalType[] fieldTypes = new LogicalType[4];
    fieldTypes[0] = new VarCharType();
    fieldTypes[1] = new IntType();
    List<RowType.RowField> arrayRowFieldList = Collections.singletonList(new RowType.RowField("_col2_col0", new VarCharType()));
    fieldTypes[2] = new ArrayType(new RowType(arrayRowFieldList));
    List<RowType.RowField> mapRowFieldList = Arrays.asList(new RowType.RowField("_col3_col0", new VarCharType()), new RowType.RowField("_col3_col1", new TimestampType()));
    fieldTypes[3] = new MapType(new VarCharType(), new RowType(mapRowFieldList));
    String schema = "struct<_col0:string,_col1:int,_col2:array<struct<_col2_col0:string>>," + "_col3:map<string,struct<_col3_col0:string,_col3_col1:timestamp>>>";
    File outDir = TEMPORARY_FOLDER.newFolder();
    Properties writerProps = new Properties();
    writerProps.setProperty("orc.compress", "LZ4");
    final OrcBulkWriterFactory<RowData> writer = new OrcBulkWriterFactory<>(new RowDataVectorizer(schema, fieldTypes), writerProps, new Configuration());
    StreamingFileSink<RowData> sink = StreamingFileSink.forBulkFormat(new org.apache.flink.core.fs.Path(outDir.toURI()), writer).withBucketCheckInterval(10000).build();
    try (OneInputStreamOperatorTestHarness<RowData, Object> testHarness = new OneInputStreamOperatorTestHarness<>(new StreamSink<>(sink), 1, 1, 0)) {
        testHarness.setup();
        testHarness.open();
        int time = 0;
        for (final RowData record : data) {
            testHarness.processElement(record, ++time);
        }
        testHarness.snapshot(1, ++time);
        testHarness.notifyOfCompletedCheckpoint(1);
    }
    return outDir.getAbsolutePath();
}
Also used : Configuration(org.apache.hadoop.conf.Configuration) LogicalType(org.apache.flink.table.types.logical.LogicalType) RowType(org.apache.flink.table.types.logical.RowType) Properties(java.util.Properties) MapType(org.apache.flink.table.types.logical.MapType) IntType(org.apache.flink.table.types.logical.IntType) ArrayType(org.apache.flink.table.types.logical.ArrayType) OrcBulkWriterFactory(org.apache.flink.orc.writer.OrcBulkWriterFactory) GenericRowData(org.apache.flink.table.data.GenericRowData) RowData(org.apache.flink.table.data.RowData) RowDataVectorizer(org.apache.flink.orc.vector.RowDataVectorizer) TimestampType(org.apache.flink.table.types.logical.TimestampType) VarCharType(org.apache.flink.table.types.logical.VarCharType) OneInputStreamOperatorTestHarness(org.apache.flink.streaming.util.OneInputStreamOperatorTestHarness) OrcFile(org.apache.orc.OrcFile) File(java.io.File)

Aggregations

RowDataVectorizer (org.apache.flink.orc.vector.RowDataVectorizer)3 RowData (org.apache.flink.table.data.RowData)3 Configuration (org.apache.hadoop.conf.Configuration)3 File (java.io.File)2 Properties (java.util.Properties)2 OrcBulkWriterFactory (org.apache.flink.orc.writer.OrcBulkWriterFactory)2 OneInputStreamOperatorTestHarness (org.apache.flink.streaming.util.OneInputStreamOperatorTestHarness)2 GenericRowData (org.apache.flink.table.data.GenericRowData)2 LogicalType (org.apache.flink.table.types.logical.LogicalType)2 RowType (org.apache.flink.table.types.logical.RowType)2 OrcFile (org.apache.orc.OrcFile)2 BulkWriter (org.apache.flink.api.common.serialization.BulkWriter)1 Path (org.apache.flink.core.fs.Path)1 EncodingFormat (org.apache.flink.table.connector.format.EncodingFormat)1 DataType (org.apache.flink.table.types.DataType)1 ArrayType (org.apache.flink.table.types.logical.ArrayType)1 IntType (org.apache.flink.table.types.logical.IntType)1 MapType (org.apache.flink.table.types.logical.MapType)1 TimestampType (org.apache.flink.table.types.logical.TimestampType)1 VarCharType (org.apache.flink.table.types.logical.VarCharType)1