Search in sources :

Example 6 with SyntheticFileId

use of org.apache.hadoop.hive.ql.io.SyntheticFileId in project hive by apache.

the class LlapCacheMetadataSerializer method decodeFileKey.

/**
 *  If the underlying filesystem supports it, the file key can be a unique file/inode ID represented by a long,
 *  otherwise its a combination of the path hash, the modification time and the length of the file.
 *
 *  @see org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader#determineFileId
 */
@VisibleForTesting
static Object decodeFileKey(ByteString encodedFileKey) throws IOException {
    byte[] bytes = encodedFileKey.toByteArray();
    DataInput in = new DataInputStream(new ByteArrayInputStream(bytes));
    Object fileKey;
    if (bytes.length == Long.BYTES) {
        fileKey = in.readLong();
    } else {
        SyntheticFileId fileId = new SyntheticFileId();
        fileId.readFields(in);
        fileKey = fileId;
    }
    return fileKey;
}
Also used : DataInput(java.io.DataInput) ByteArrayInputStream(java.io.ByteArrayInputStream) SyntheticFileId(org.apache.hadoop.hive.ql.io.SyntheticFileId) DataInputStream(java.io.DataInputStream) VisibleForTesting(com.google.common.annotations.VisibleForTesting)

Example 7 with SyntheticFileId

use of org.apache.hadoop.hive.ql.io.SyntheticFileId in project hive by apache.

the class TestOrcMetadataCache method testGetOrcTailForPathWithFileId.

@Test
public void testGetOrcTailForPathWithFileId() throws Exception {
    DummyMemoryManager mm = new DummyMemoryManager();
    DummyCachePolicy cp = new DummyCachePolicy();
    final int MAX_ALLOC = 64;
    LlapDaemonCacheMetrics metrics = LlapDaemonCacheMetrics.create("", "");
    BuddyAllocator alloc = new BuddyAllocator(false, false, 8, MAX_ALLOC, 1, 4 * 4096, 0, null, mm, metrics, null, true);
    MetadataCache cache = new MetadataCache(alloc, mm, cp, true, metrics);
    Path path = new Path("../data/files/alltypesorc");
    Configuration jobConf = new Configuration();
    Configuration daemonConf = new Configuration();
    CacheTag tag = CacheTag.build("test-table");
    FileSystem fs = FileSystem.get(daemonConf);
    FileStatus fileStatus = fs.getFileStatus(path);
    OrcTail uncached = OrcEncodedDataReader.getOrcTailForPath(fileStatus.getPath(), jobConf, tag, daemonConf, cache, new SyntheticFileId(fileStatus));
    jobConf.set(HiveConf.ConfVars.LLAP_IO_CACHE_ONLY.varname, "true");
    // this should work from the cache, by recalculating the same fileId
    OrcTail cached = OrcEncodedDataReader.getOrcTailForPath(fileStatus.getPath(), jobConf, tag, daemonConf, cache, null);
    assertEquals(uncached.getSerializedTail(), cached.getSerializedTail());
    assertEquals(uncached.getFileTail(), cached.getFileTail());
}
Also used : Path(org.apache.hadoop.fs.Path) FileStatus(org.apache.hadoop.fs.FileStatus) Configuration(org.apache.hadoop.conf.Configuration) SyntheticFileId(org.apache.hadoop.hive.ql.io.SyntheticFileId) MetadataCache(org.apache.hadoop.hive.llap.io.metadata.MetadataCache) LlapDaemonCacheMetrics(org.apache.hadoop.hive.llap.metrics.LlapDaemonCacheMetrics) FileSystem(org.apache.hadoop.fs.FileSystem) CacheTag(org.apache.hadoop.hive.common.io.CacheTag) OrcTail(org.apache.orc.impl.OrcTail) Test(org.junit.Test)

Example 8 with SyntheticFileId

use of org.apache.hadoop.hive.ql.io.SyntheticFileId in project hive by apache.

the class HiveVectorizedReader method reader.

public static <D> CloseableIterable<D> reader(InputFile inputFile, FileScanTask task, Map<Integer, ?> idToConstant, TaskAttemptContext context) {
    // Tweaks on jobConf here are relevant for this task only, so we need to copy it first as context's conf is reused..
    JobConf job = new JobConf((JobConf) context.getConfiguration());
    Path path = new Path(inputFile.location());
    FileFormat format = task.file().format();
    Reporter reporter = ((MapredIcebergInputFormat.CompatibilityTaskAttemptContextImpl) context).getLegacyReporter();
    // Hive by default requires partition columns to be read too. This is not required for identity partition
    // columns, as we will add this as constants later.
    int[] partitionColIndices = null;
    Object[] partitionValues = null;
    PartitionSpec partitionSpec = task.spec();
    List<Integer> readColumnIds = ColumnProjectionUtils.getReadColumnIDs(job);
    if (!partitionSpec.isUnpartitioned()) {
        List<PartitionField> fields = partitionSpec.fields();
        List<Integer> partitionColIndicesList = Lists.newLinkedList();
        List<Object> partitionValuesList = Lists.newLinkedList();
        for (PartitionField partitionField : fields) {
            if (partitionField.transform().isIdentity()) {
                // Get columns in read schema order (which matches those of readColumnIds) to find partition column indices
                List<Types.NestedField> columns = task.spec().schema().columns();
                for (int colIdx = 0; colIdx < columns.size(); ++colIdx) {
                    if (columns.get(colIdx).fieldId() == partitionField.sourceId()) {
                        // Skip reading identity partition columns from source file...
                        readColumnIds.remove((Integer) colIdx);
                        // ...and use the corresponding constant value instead
                        partitionColIndicesList.add(colIdx);
                        partitionValuesList.add(idToConstant.get(partitionField.sourceId()));
                        break;
                    }
                }
            }
        }
        partitionColIndices = ArrayUtils.toPrimitive(partitionColIndicesList.toArray(new Integer[0]));
        partitionValues = partitionValuesList.toArray(new Object[0]);
        ColumnProjectionUtils.setReadColumns(job, readColumnIds);
    }
    try {
        long start = task.start();
        long length = task.length();
        // TODO: Iceberg currently does not track the last modification time of a file. Until that's added,
        // we need to set Long.MIN_VALUE as last modification time in the fileId triplet.
        SyntheticFileId fileId = new SyntheticFileId(path, task.file().fileSizeInBytes(), Long.MIN_VALUE);
        RecordReader<NullWritable, VectorizedRowBatch> recordReader = null;
        switch(format) {
            case ORC:
                recordReader = orcRecordReader(job, reporter, task, inputFile, path, start, length, readColumnIds, fileId);
                break;
            case PARQUET:
                recordReader = parquetRecordReader(job, reporter, task, path, start, length);
                break;
            default:
                throw new UnsupportedOperationException("Vectorized Hive reading unimplemented for format: " + format);
        }
        return createVectorizedRowBatchIterable(recordReader, job, partitionColIndices, partitionValues);
    } catch (IOException ioe) {
        throw new RuntimeException("Error creating vectorized record reader for " + inputFile, ioe);
    }
}
Also used : SyntheticFileId(org.apache.hadoop.hive.ql.io.SyntheticFileId) FileFormat(org.apache.iceberg.FileFormat) VectorizedRowBatch(org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch) PartitionField(org.apache.iceberg.PartitionField) JobConf(org.apache.hadoop.mapred.JobConf) Path(org.apache.hadoop.fs.Path) Reporter(org.apache.hadoop.mapred.Reporter) IOException(java.io.IOException) PartitionSpec(org.apache.iceberg.PartitionSpec) NullWritable(org.apache.hadoop.io.NullWritable)

Aggregations

SyntheticFileId (org.apache.hadoop.hive.ql.io.SyntheticFileId)8 Path (org.apache.hadoop.fs.Path)6 IOException (java.io.IOException)3 Test (org.junit.Test)3 VisibleForTesting (com.google.common.annotations.VisibleForTesting)2 ByteString (com.google.protobuf.ByteString)2 Configuration (org.apache.hadoop.conf.Configuration)2 CacheTag (org.apache.hadoop.hive.common.io.CacheTag)2 MetadataCache (org.apache.hadoop.hive.llap.io.metadata.MetadataCache)2 LlapDaemonCacheMetrics (org.apache.hadoop.hive.llap.metrics.LlapDaemonCacheMetrics)2 OrcTail (org.apache.orc.impl.OrcTail)2 ByteArrayInputStream (java.io.ByteArrayInputStream)1 ByteArrayOutputStream (java.io.ByteArrayOutputStream)1 DataInput (java.io.DataInput)1 DataInputStream (java.io.DataInputStream)1 DataOutputStream (java.io.DataOutputStream)1 FileStatus (org.apache.hadoop.fs.FileStatus)1 FileSystem (org.apache.hadoop.fs.FileSystem)1 IllegalCacheConfigurationException (org.apache.hadoop.hive.llap.IllegalCacheConfigurationException)1 LlapDaemonProtocolProtos (org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos)1