Search in sources :

Example 1 with AvroParquetWriter

use of org.apache.parquet.avro.AvroParquetWriter in project nifi by apache.

the class FetchParquetTest method writeParquetUsers.

private void writeParquetUsers(final File parquetFile, int numUsers) throws IOException {
    if (parquetFile.exists()) {
        Assert.assertTrue(parquetFile.delete());
    }
    final Path parquetPath = new Path(parquetFile.getPath());
    final AvroParquetWriter.Builder<GenericRecord> writerBuilder = AvroParquetWriter.<GenericRecord>builder(parquetPath).withSchema(schema).withConf(testConf);
    try (final ParquetWriter<GenericRecord> writer = writerBuilder.build()) {
        for (int i = 0; i < numUsers; i++) {
            final GenericRecord user = new GenericData.Record(schema);
            user.put("name", "Bob" + i);
            user.put("favorite_number", i);
            user.put("favorite_color", "blue" + i);
            writer.write(user);
        }
    }
}
Also used : Path(org.apache.hadoop.fs.Path) AvroParquetWriter(org.apache.parquet.avro.AvroParquetWriter) Record(org.apache.nifi.serialization.record.Record) GenericRecord(org.apache.avro.generic.GenericRecord) GenericRecord(org.apache.avro.generic.GenericRecord)

Example 2 with AvroParquetWriter

use of org.apache.parquet.avro.AvroParquetWriter in project alluxio by Alluxio.

the class ParquetWriter method create.

/**
 * Creates a Parquet writer specifying a row group size.
 *
 * @param schema the schema
 * @param uri the URI to the output
 * @param rowGroupSize the row group size
 * @param enableDictionary whether to enable dictionary
 * @param compressionCodec the compression codec name
 * @return the writer
 */
public static ParquetWriter create(TableSchema schema, AlluxioURI uri, int rowGroupSize, boolean enableDictionary, String compressionCodec) throws IOException {
    Configuration conf = ReadWriterUtils.writeThroughConf();
    ParquetSchema parquetSchema = schema.toParquet();
    return new ParquetWriter(AvroParquetWriter.<Record>builder(HadoopOutputFile.fromPath(new JobPath(uri.getScheme(), uri.getAuthority().toString(), uri.getPath()), conf)).withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0).withConf(conf).withCompressionCodec(CompressionCodecName.fromConf(compressionCodec)).withRowGroupSize(rowGroupSize).withDictionaryPageSize(org.apache.parquet.hadoop.ParquetWriter.DEFAULT_PAGE_SIZE).withDictionaryEncoding(enableDictionary).withPageSize(org.apache.parquet.hadoop.ParquetWriter.DEFAULT_PAGE_SIZE).withDataModel(GenericData.get()).withSchema(parquetSchema.getSchema()).build());
}
Also used : JobPath(alluxio.job.plan.transform.format.JobPath) Configuration(org.apache.hadoop.conf.Configuration) AvroParquetWriter(org.apache.parquet.avro.AvroParquetWriter)

Example 3 with AvroParquetWriter

use of org.apache.parquet.avro.AvroParquetWriter in project h2o-3 by h2oai.

the class ParquetFileGenerator method generateAvroPrimitiveTypes.

static File generateAvroPrimitiveTypes(File parentDir, String filename, int nrows, Date date) throws IOException {
    File f = new File(parentDir, filename);
    Schema schema = new Schema.Parser().parse(Resources.getResource("PrimitiveAvro.avsc").openStream());
    AvroParquetWriter<GenericRecord> writer = new AvroParquetWriter<GenericRecord>(new Path(f.getPath()), schema);
    try {
        DateFormat format = new SimpleDateFormat("yy-MMM-dd:hh.mm.ss.SSS aaa");
        for (int i = 0; i < nrows; i++) {
            GenericData.Record record = new GenericRecordBuilder(schema).set("mynull", null).set("myboolean", i % 2 == 0).set("myint", 1 + i).set("mylong", 2L + i).set("myfloat", 3.1f + i).set("mydouble", 4.1 + i).set("mydate", format.format(new Date(date.getTime() - (i * 1000 * 3600)))).set("myuuid", UUID.randomUUID()).set("mystring", "hello world: " + i).set("myenum", i % 2 == 0 ? "a" : "b").build();
            writer.write(record);
        }
    } finally {
        writer.close();
    }
    return f;
}
Also used : Path(org.apache.hadoop.fs.Path) Schema(org.apache.avro.Schema) AvroParquetWriter(org.apache.parquet.avro.AvroParquetWriter) GenericData(org.apache.avro.generic.GenericData) SimpleDateFormat(java.text.SimpleDateFormat) DateFormat(java.text.DateFormat) GenericRecordBuilder(org.apache.avro.generic.GenericRecordBuilder) GenericRecord(org.apache.avro.generic.GenericRecord) File(java.io.File) SimpleDateFormat(java.text.SimpleDateFormat)

Example 4 with AvroParquetWriter

use of org.apache.parquet.avro.AvroParquetWriter in project nifi by apache.

the class PutParquet method createHDFSRecordWriter.

@Override
public HDFSRecordWriter createHDFSRecordWriter(final ProcessContext context, final FlowFile flowFile, final Configuration conf, final Path path, final RecordSchema schema) throws IOException, SchemaNotFoundException {
    final Schema avroSchema = AvroTypeUtil.extractAvroSchema(schema);
    final AvroParquetWriter.Builder<GenericRecord> parquetWriter = AvroParquetWriter.<GenericRecord>builder(path).withSchema(avroSchema);
    applyCommonConfig(parquetWriter, context, flowFile, conf);
    return new AvroParquetHDFSRecordWriter(parquetWriter.build(), avroSchema);
}
Also used : RecordSchema(org.apache.nifi.serialization.record.RecordSchema) Schema(org.apache.avro.Schema) AvroParquetWriter(org.apache.parquet.avro.AvroParquetWriter) AvroParquetHDFSRecordWriter(org.apache.nifi.processors.parquet.record.AvroParquetHDFSRecordWriter) GenericRecord(org.apache.avro.generic.GenericRecord)

Aggregations

AvroParquetWriter (org.apache.parquet.avro.AvroParquetWriter)4 GenericRecord (org.apache.avro.generic.GenericRecord)3 Schema (org.apache.avro.Schema)2 Path (org.apache.hadoop.fs.Path)2 JobPath (alluxio.job.plan.transform.format.JobPath)1 File (java.io.File)1 DateFormat (java.text.DateFormat)1 SimpleDateFormat (java.text.SimpleDateFormat)1 GenericData (org.apache.avro.generic.GenericData)1 GenericRecordBuilder (org.apache.avro.generic.GenericRecordBuilder)1 Configuration (org.apache.hadoop.conf.Configuration)1 AvroParquetHDFSRecordWriter (org.apache.nifi.processors.parquet.record.AvroParquetHDFSRecordWriter)1 Record (org.apache.nifi.serialization.record.Record)1 RecordSchema (org.apache.nifi.serialization.record.RecordSchema)1