Search in sources :

Example 6 with SeekableInput

use of org.apache.avro.file.SeekableInput in project parquet-mr by apache.

the class SchemaCommand method getParquetSchema.

private String getParquetSchema(String source) throws IOException {
    Formats.Format format;
    try (SeekableInput in = openSeekable(source)) {
        format = Formats.detectFormat((InputStream) in);
        in.seek(0);
        switch(format) {
            case PARQUET:
                try (ParquetFileReader reader = new ParquetFileReader(getConf(), qualifiedPath(source), ParquetMetadataConverter.NO_FILTER)) {
                    return reader.getFileMetaData().getSchema().toString();
                }
            default:
                throw new IllegalArgumentException(String.format("Could not get a Parquet schema for format %s: %s", format, source));
        }
    }
}
Also used : InputStream(java.io.InputStream) ParquetFileReader(org.apache.parquet.hadoop.ParquetFileReader) SeekableInput(org.apache.avro.file.SeekableInput) Formats(org.apache.parquet.cli.util.Formats)

Example 7 with SeekableInput

use of org.apache.avro.file.SeekableInput in project parquet-mr by apache.

the class BaseCommand method getAvroSchema.

protected Schema getAvroSchema(String source) throws IOException {
    Formats.Format format;
    try (SeekableInput in = openSeekable(source)) {
        format = Formats.detectFormat((InputStream) in);
        in.seek(0);
        switch(format) {
            case PARQUET:
                return Schemas.fromParquet(getConf(), qualifiedURI(source));
            case AVRO:
                return Schemas.fromAvro(open(source));
            case TEXT:
                if (source.endsWith("avsc")) {
                    return Schemas.fromAvsc(open(source));
                } else if (source.endsWith("json")) {
                    return Schemas.fromJSON("json", open(source));
                }
            default:
        }
        throw new IllegalArgumentException(String.format("Could not determine file format of %s.", source));
    }
}
Also used : SeekableFSDataInputStream(org.apache.parquet.cli.util.SeekableFSDataInputStream) InputStream(java.io.InputStream) SeekableInput(org.apache.avro.file.SeekableInput) Formats(org.apache.parquet.cli.util.Formats)

Example 8 with SeekableInput

use of org.apache.avro.file.SeekableInput in project incubator-gobblin by apache.

the class AvroExternalTable method getSchemaFromAvroDataFile.

private Schema getSchemaFromAvroDataFile() throws IOException {
    String firstDataFilePath = HdfsReader.getFirstDataFilePathInDir(this.dataLocationInHdfs);
    LOG.info("Extracting schema for table " + this.name + " from avro data file " + firstDataFilePath);
    SeekableInput sin = new HdfsReader(firstDataFilePath).getFsInput();
    try (DataFileReader<Void> dfr = new DataFileReader<>(sin, new GenericDatumReader<Void>())) {
        Schema schema = dfr.getSchema();
        return schema;
    }
}
Also used : DataFileReader(org.apache.avro.file.DataFileReader) Schema(org.apache.avro.Schema) SeekableInput(org.apache.avro.file.SeekableInput)

Example 9 with SeekableInput

use of org.apache.avro.file.SeekableInput in project crunch by cloudera.

the class AvroRecordReader method initialize.

@Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException, InterruptedException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration conf = context.getConfiguration();
    SeekableInput in = new FsInput(split.getPath(), conf);
    DatumReader<T> datumReader = null;
    if (context.getConfiguration().getBoolean(AvroJob.INPUT_IS_REFLECT, true)) {
        ReflectDataFactory factory = Avros.getReflectDataFactory(conf);
        datumReader = factory.getReader(schema);
    } else {
        datumReader = new SpecificDatumReader<T>(schema);
    }
    this.reader = DataFileReader.openReader(in, datumReader);
    // sync to start
    reader.sync(split.getStart());
    this.start = reader.tell();
    this.end = split.getStart() + split.getLength();
}
Also used : Configuration(org.apache.hadoop.conf.Configuration) FsInput(org.apache.avro.mapred.FsInput) SeekableInput(org.apache.avro.file.SeekableInput) FileSplit(org.apache.hadoop.mapreduce.lib.input.FileSplit)

Example 10 with SeekableInput

use of org.apache.avro.file.SeekableInput in project flink by apache.

the class AvroInputFormat method initReader.

private DataFileReader<E> initReader(FileInputSplit split) throws IOException {
    DatumReader<E> datumReader;
    if (org.apache.avro.generic.GenericRecord.class == avroValueType) {
        datumReader = new GenericDatumReader<E>();
    } else {
        datumReader = org.apache.avro.specific.SpecificRecordBase.class.isAssignableFrom(avroValueType) ? new SpecificDatumReader<E>(avroValueType) : new ReflectDatumReader<E>(avroValueType);
    }
    if (LOG.isInfoEnabled()) {
        LOG.info("Opening split {}", split);
    }
    SeekableInput in = new FSDataInputStreamWrapper(stream, split.getPath().getFileSystem().getFileStatus(split.getPath()).getLen());
    DataFileReader<E> dataFileReader = (DataFileReader) DataFileReader.openReader(in, datumReader);
    if (LOG.isDebugEnabled()) {
        LOG.debug("Loaded SCHEMA: {}", dataFileReader.getSchema());
    }
    end = split.getStart() + split.getLength();
    recordsReadSinceLastSync = 0;
    return dataFileReader;
}
Also used : DataFileReader(org.apache.avro.file.DataFileReader) SeekableInput(org.apache.avro.file.SeekableInput) FSDataInputStreamWrapper(org.apache.flink.formats.avro.utils.FSDataInputStreamWrapper) SpecificDatumReader(org.apache.avro.specific.SpecificDatumReader) ReflectDatumReader(org.apache.avro.reflect.ReflectDatumReader)

Aggregations

SeekableInput (org.apache.avro.file.SeekableInput)11 DataFileReader (org.apache.avro.file.DataFileReader)6 GenericRecord (org.apache.avro.generic.GenericRecord)5 GenericDatumReader (org.apache.avro.generic.GenericDatumReader)4 FsInput (org.apache.avro.mapred.FsInput)4 Schema (org.apache.avro.Schema)3 Configuration (org.apache.hadoop.conf.Configuration)3 InputStream (java.io.InputStream)2 ArrayList (java.util.ArrayList)2 SeekableByteArrayInput (org.apache.avro.file.SeekableByteArrayInput)2 ReflectDatumReader (org.apache.avro.reflect.ReflectDatumReader)2 SpecificDatumReader (org.apache.avro.specific.SpecificDatumReader)2 Utf8 (org.apache.avro.util.Utf8)2 Path (org.apache.hadoop.fs.Path)2 Formats (org.apache.parquet.cli.util.Formats)2 Test (org.junit.Test)2 AbstractIterator (com.google.common.collect.AbstractIterator)1 RowVisitor (com.thinkbiganalytics.nifi.thrift.api.RowVisitor)1 ByteArrayOutputStream (java.io.ByteArrayOutputStream)1 IOException (java.io.IOException)1