Search in sources :

Example 1 with CSVProperties

use of org.apache.parquet.cli.csv.CSVProperties in project parquet-mr by apache.

the class ConvertCSVCommand method run.

@Override
@SuppressWarnings("unchecked")
public int run() throws IOException {
    Preconditions.checkArgument(targets != null && targets.size() == 1, "CSV path is required.");
    if (header != null) {
        // if a header is given on the command line, don't assume one is in the file
        noHeader = true;
    }
    CSVProperties props = new CSVProperties.Builder().delimiter(delimiter).escape(escape).quote(quote).header(header).hasHeader(!noHeader).linesToSkip(linesToSkip).charset(charsetName).build();
    String source = targets.get(0);
    Schema csvSchema;
    if (avroSchemaFile != null) {
        csvSchema = Schemas.fromAvsc(open(avroSchemaFile));
    } else {
        Set<String> required = ImmutableSet.of();
        if (requiredFields != null) {
            required = ImmutableSet.copyOf(requiredFields);
        }
        String filename = new File(source).getName();
        String recordName;
        if (filename.contains(".")) {
            recordName = filename.substring(0, filename.indexOf("."));
        } else {
            recordName = filename;
        }
        csvSchema = AvroCSV.inferNullableSchema(recordName, open(source), props, required);
    }
    long count = 0;
    try (AvroCSVReader<Record> reader = new AvroCSVReader<>(open(source), props, csvSchema, Record.class, true)) {
        CompressionCodecName codec = Codecs.parquetCodec(compressionCodecName);
        try (ParquetWriter<Record> writer = AvroParquetWriter.<Record>builder(qualifiedPath(outputPath)).withWriterVersion(v2 ? PARQUET_2_0 : PARQUET_1_0).withWriteMode(overwrite ? ParquetFileWriter.Mode.OVERWRITE : ParquetFileWriter.Mode.CREATE).withCompressionCodec(codec).withDictionaryEncoding(true).withDictionaryPageSize(dictionaryPageSize).withPageSize(pageSize).withRowGroupSize(rowGroupSize).withDataModel(GenericData.get()).withConf(getConf()).withSchema(csvSchema).build()) {
            for (Record record : reader) {
                writer.write(record);
            }
        } catch (RuntimeException e) {
            throw new RuntimeException("Failed on record " + count, e);
        }
    }
    return 0;
}
Also used : Schema(org.apache.avro.Schema) CSVProperties(org.apache.parquet.cli.csv.CSVProperties) AvroCSVReader(org.apache.parquet.cli.csv.AvroCSVReader) CompressionCodecName(org.apache.parquet.hadoop.metadata.CompressionCodecName) Record(org.apache.avro.generic.GenericData.Record) File(java.io.File)

Example 2 with CSVProperties

use of org.apache.parquet.cli.csv.CSVProperties in project parquet-mr by apache.

the class CSVSchemaCommand method run.

@Override
public int run() throws IOException {
    Preconditions.checkArgument(samplePaths != null && !samplePaths.isEmpty(), "Sample CSV path is required");
    Preconditions.checkArgument(samplePaths.size() == 1, "Only one CSV sample can be given");
    if (header != null) {
        // if a header is given on the command line, do assume one is in the file
        noHeader = true;
    }
    CSVProperties props = new CSVProperties.Builder().delimiter(delimiter).escape(escape).quote(quote).header(header).hasHeader(!noHeader).linesToSkip(linesToSkip).charset(charsetName).build();
    Set<String> required = ImmutableSet.of();
    if (requiredFields != null) {
        required = ImmutableSet.copyOf(requiredFields);
    }
    // assume fields are nullable by default, users can easily change this
    String sampleSchema = AvroCSV.inferNullableSchema(recordName, open(samplePaths.get(0)), props, required).toString(!minimize);
    output(sampleSchema, console, outputPath);
    return 0;
}
Also used : CSVProperties(org.apache.parquet.cli.csv.CSVProperties)

Aggregations

CSVProperties (org.apache.parquet.cli.csv.CSVProperties)2 File (java.io.File)1 Schema (org.apache.avro.Schema)1 Record (org.apache.avro.generic.GenericData.Record)1 AvroCSVReader (org.apache.parquet.cli.csv.AvroCSVReader)1 CompressionCodecName (org.apache.parquet.hadoop.metadata.CompressionCodecName)1