Examples with InternalHiveSplitFactory - com.facebook.presto.hive.util.InternalHiveSplitFactory

Example 1 with InternalHiveSplitFactory

use of com.facebook.presto.hive.util.InternalHiveSplitFactory in project presto by prestodb.

the class ManifestPartitionLoader method createInternalHiveSplitFactory.

private InternalHiveSplitFactory createInternalHiveSplitFactory(Table table, HivePartitionMetadata partition, ConnectorSession session, Optional<Domain> pathDomain, HdfsEnvironment hdfsEnvironment, HdfsContext hdfsContext, boolean schedulerUsesHostAddresses) throws IOException {
    String partitionName = partition.getHivePartition().getPartitionId();
    Storage storage = partition.getPartition().map(Partition::getStorage).orElse(table.getStorage());
    String inputFormatName = storage.getStorageFormat().getInputFormat();
    int partitionDataColumnCount = partition.getPartition().map(p -> p.getColumns().size()).orElse(table.getDataColumns().size());
    List<HivePartitionKey> partitionKeys = getPartitionKeys(table, partition.getPartition(), partitionName);
    Path path = new Path(getPartitionLocation(table, partition.getPartition()));
    Configuration configuration = hdfsEnvironment.getConfiguration(hdfsContext, path);
    InputFormat<?, ?> inputFormat = getInputFormat(configuration, inputFormatName, false);
    ExtendedFileSystem fileSystem = hdfsEnvironment.getFileSystem(hdfsContext, path);
    return new InternalHiveSplitFactory(fileSystem, inputFormat, pathDomain, getNodeSelectionStrategy(session), getMaxInitialSplitSize(session), false, new HiveSplitPartitionInfo(storage, path.toUri(), partitionKeys, partitionName, partitionDataColumnCount, partition.getTableToPartitionMapping(), Optional.empty(), partition.getRedundantColumnDomains()), schedulerUsesHostAddresses, partition.getEncryptionInformation());
}

Also used : Table(com.facebook.presto.hive.metastore.Table) ListenableFuture(com.google.common.util.concurrent.ListenableFuture) BlockLocation(org.apache.hadoop.fs.BlockLocation) MetastoreUtil.getPartitionLocation(com.facebook.presto.hive.metastore.MetastoreUtil.getPartitionLocation) MALFORMED_HIVE_FILE_STATISTICS(com.facebook.presto.hive.HiveErrorCode.MALFORMED_HIVE_FILE_STATISTICS) PrestoException(com.facebook.presto.spi.PrestoException) FileStatus(org.apache.hadoop.fs.FileStatus) HiveSessionProperties.getNodeSelectionStrategy(com.facebook.presto.hive.HiveSessionProperties.getNodeSelectionStrategy) Partition(com.facebook.presto.hive.metastore.Partition) HiveUtil.getInputFormat(com.facebook.presto.hive.HiveUtil.getInputFormat) IGNORED(com.facebook.presto.hive.NestedDirectoryPolicy.IGNORED) ImmutableList(com.google.common.collect.ImmutableList) Verify.verify(com.google.common.base.Verify.verify) Configuration(org.apache.hadoop.conf.Configuration) Map(java.util.Map) Objects.requireNonNull(java.util.Objects.requireNonNull) InputFormat(org.apache.hadoop.mapred.InputFormat) Path(org.apache.hadoop.fs.Path) RECURSE(com.facebook.presto.hive.NestedDirectoryPolicy.RECURSE) MANIFEST_VERSION(com.facebook.presto.hive.HiveManifestUtils.MANIFEST_VERSION) HiveManifestUtils.decompressFileSizes(com.facebook.presto.hive.HiveManifestUtils.decompressFileSizes) Storage(com.facebook.presto.hive.metastore.Storage) Iterator(java.util.Iterator) FILE_SIZES(com.facebook.presto.hive.HiveManifestUtils.FILE_SIZES) HiveManifestUtils.decompressFileNames(com.facebook.presto.hive.HiveManifestUtils.decompressFileNames) ExtendedFileSystem(com.facebook.presto.hive.filesystem.ExtendedFileSystem) ImmutableList.toImmutableList(com.google.common.collect.ImmutableList.toImmutableList) InternalHiveSplitFactory(com.facebook.presto.hive.util.InternalHiveSplitFactory) LocatedFileStatus(org.apache.hadoop.fs.LocatedFileStatus) HiveSessionProperties.getMaxSplitSize(com.facebook.presto.hive.HiveSessionProperties.getMaxSplitSize) IOException(java.io.IOException) FILE_NAMES(com.facebook.presto.hive.HiveManifestUtils.FILE_NAMES) Domain(com.facebook.presto.common.predicate.Domain) String.format(java.lang.String.format) HiveSessionProperties.getMaxInitialSplitSize(com.facebook.presto.hive.HiveSessionProperties.getMaxInitialSplitSize) ConnectorSession(com.facebook.presto.spi.ConnectorSession) UncheckedIOException(java.io.UncheckedIOException) List(java.util.List) VERSION_1(com.facebook.presto.hive.HiveManifestUtils.VERSION_1) Optional(java.util.Optional) HiveSessionProperties.isManifestVerificationEnabled(com.facebook.presto.hive.HiveSessionProperties.isManifestVerificationEnabled) Path(org.apache.hadoop.fs.Path) InternalHiveSplitFactory(com.facebook.presto.hive.util.InternalHiveSplitFactory) Storage(com.facebook.presto.hive.metastore.Storage) Configuration(org.apache.hadoop.conf.Configuration) ExtendedFileSystem(com.facebook.presto.hive.filesystem.ExtendedFileSystem)

Example 2 with InternalHiveSplitFactory

use of com.facebook.presto.hive.util.InternalHiveSplitFactory in project presto by prestodb.

the class StoragePartitionLoader method getBucketedSplits.

private List<InternalHiveSplit> getBucketedSplits(Path path, ExtendedFileSystem fileSystem, InternalHiveSplitFactory splitFactory, BucketSplitInfo bucketSplitInfo, Optional<HiveSplit.BucketConversion> bucketConversion, String partitionName, boolean splittable, PathFilter pathFilter) {
    int readBucketCount = bucketSplitInfo.getReadBucketCount();
    int tableBucketCount = bucketSplitInfo.getTableBucketCount();
    int partitionBucketCount = bucketConversion.map(HiveSplit.BucketConversion::getPartitionBucketCount).orElse(tableBucketCount);
    int bucketCount = max(readBucketCount, partitionBucketCount);
    checkState(readBucketCount <= tableBucketCount, "readBucketCount(%s) should be less than or equal to tableBucketCount(%s)", readBucketCount, tableBucketCount);
    // list all files in the partition
    List<HiveFileInfo> fileInfos = new ArrayList<>(partitionBucketCount);
    try {
        Iterators.addAll(fileInfos, directoryLister.list(fileSystem, table, path, namenodeStats, pathFilter, new HiveDirectoryContext(FAIL, isUseListDirectoryCache(session))));
    } catch (HiveFileIterator.NestedDirectoryNotAllowedException e) {
        // Fail here to be on the safe side. This seems to be the same as what Hive does
        throw new PrestoException(HIVE_INVALID_BUCKET_FILES, format("Hive table '%s' is corrupt. Found sub-directory in bucket directory for partition: %s", table.getSchemaTableName(), partitionName));
    }
    ListMultimap<Integer, HiveFileInfo> bucketToFileInfo = ArrayListMultimap.create();
    if (!shouldCreateFilesForMissingBuckets(table, session)) {
        fileInfos.stream().forEach(fileInfo -> {
            String fileName = fileInfo.getPath().getName();
            OptionalInt bucket = getBucketNumber(fileName);
            if (bucket.isPresent()) {
                bucketToFileInfo.put(bucket.getAsInt(), fileInfo);
            } else {
                throw new PrestoException(HIVE_INVALID_BUCKET_FILES, format("invalid hive bucket file name: %s", fileName));
            }
        });
    } else {
        // build mapping of file name to bucket
        for (HiveFileInfo file : fileInfos) {
            String fileName = file.getPath().getName();
            OptionalInt bucket = getBucketNumber(fileName);
            if (bucket.isPresent()) {
                bucketToFileInfo.put(bucket.getAsInt(), file);
                continue;
            }
            // legacy mode requires exactly one file per bucket
            if (fileInfos.size() != partitionBucketCount) {
                throw new PrestoException(HIVE_INVALID_BUCKET_FILES, format("Hive table '%s' is corrupt. File '%s' does not match the standard naming pattern, and the number " + "of files in the directory (%s) does not match the declared bucket count (%s) for partition: %s", table.getSchemaTableName(), fileName, fileInfos.size(), partitionBucketCount, partitionName));
            }
            if (fileInfos.get(0).getPath().getName().matches("\\d+")) {
                try {
                    // File names are integer if they are created when file_renaming_enabled is set to true
                    fileInfos.sort(Comparator.comparingInt(fileInfo -> Integer.parseInt(fileInfo.getPath().getName())));
                } catch (NumberFormatException e) {
                    throw new PrestoException(HIVE_INVALID_FILE_NAMES, format("Hive table '%s' is corrupt. Some of the filenames in the partition: %s are not integers", new SchemaTableName(table.getDatabaseName(), table.getTableName()), partitionName));
                }
            } else {
                // Sort FileStatus objects (instead of, e.g., fileStatus.getPath().toString). This matches org.apache.hadoop.hive.ql.metadata.Table.getSortedPaths
                fileInfos.sort(null);
            }
            // Use position in sorted list as the bucket number
            bucketToFileInfo.clear();
            for (int i = 0; i < fileInfos.size(); i++) {
                bucketToFileInfo.put(i, fileInfos.get(i));
            }
            break;
        }
    }
    // convert files internal splits
    List<InternalHiveSplit> splitList = new ArrayList<>();
    for (int bucketNumber = 0; bucketNumber < bucketCount; bucketNumber++) {
        // Physical bucket #. This determine file name. It also determines the order of splits in the result.
        int partitionBucketNumber = bucketNumber % partitionBucketCount;
        if (!bucketToFileInfo.containsKey(partitionBucketNumber)) {
            continue;
        }
        // Logical bucket #. Each logical bucket corresponds to a "bucket" from engine's perspective.
        int readBucketNumber = bucketNumber % readBucketCount;
        boolean containsIneligibleTableBucket = false;
        List<Integer> eligibleTableBucketNumbers = new ArrayList<>();
        for (int tableBucketNumber = bucketNumber % tableBucketCount; tableBucketNumber < tableBucketCount; tableBucketNumber += bucketCount) {
            // table bucket number: this is used for evaluating "$bucket" filters.
            if (bucketSplitInfo.isTableBucketEnabled(tableBucketNumber)) {
                eligibleTableBucketNumbers.add(tableBucketNumber);
            } else {
                containsIneligibleTableBucket = true;
            }
        }
        if (!eligibleTableBucketNumbers.isEmpty() && containsIneligibleTableBucket) {
            throw new PrestoException(NOT_SUPPORTED, "The bucket filter cannot be satisfied. There are restrictions on the bucket filter when all the following is true: " + "1. a table has a different buckets count as at least one of its partitions that is read in this query; " + "2. the table has a different but compatible bucket number with another table in the query; " + "3. some buckets of the table is filtered out from the query, most likely using a filter on \"$bucket\". " + "(table name: " + table.getTableName() + ", table bucket count: " + tableBucketCount + ", " + "partition bucket count: " + partitionBucketCount + ", effective reading bucket count: " + readBucketCount + ")");
        }
        if (!eligibleTableBucketNumbers.isEmpty()) {
            for (HiveFileInfo fileInfo : bucketToFileInfo.get(partitionBucketNumber)) {
                eligibleTableBucketNumbers.stream().map(tableBucketNumber -> splitFactory.createInternalHiveSplit(fileInfo, readBucketNumber, tableBucketNumber, splittable)).forEach(optionalSplit -> optionalSplit.ifPresent(splitList::add));
            }
        }
    }
    return splitList;
}

Also used : ArrayListMultimap(com.google.common.collect.ArrayListMultimap) LoadingCache(com.google.common.cache.LoadingCache) ListMultimap(com.google.common.collect.ListMultimap) HiveSessionProperties.isFileSplittable(com.facebook.presto.hive.HiveSessionProperties.isFileSplittable) HiveSessionProperties.isUseListDirectoryCache(com.facebook.presto.hive.HiveSessionProperties.isUseListDirectoryCache) FileStatus(org.apache.hadoop.fs.FileStatus) IntPredicate(java.util.function.IntPredicate) HiveUtil.getHeaderCount(com.facebook.presto.hive.HiveUtil.getHeaderCount) HiveUtil.getInputFormat(com.facebook.presto.hive.HiveUtil.getInputFormat) Preconditions.checkArgument(com.google.common.base.Preconditions.checkArgument) SchemaTableName(com.facebook.presto.spi.SchemaTableName) FileSplit(org.apache.hadoop.mapred.FileSplit) CharStreams(com.google.common.io.CharStreams) Configuration(org.apache.hadoop.conf.Configuration) InputFormat(org.apache.hadoop.mapred.InputFormat) Path(org.apache.hadoop.fs.Path) HiveMetadata.shouldCreateFilesForMissingBuckets(com.facebook.presto.hive.HiveMetadata.shouldCreateFilesForMissingBuckets) HIVE_INVALID_FILE_NAMES(com.facebook.presto.hive.HiveErrorCode.HIVE_INVALID_FILE_NAMES) Function(com.google.common.base.Function) FileInputFormat(org.apache.hadoop.mapred.FileInputFormat) ExtendedFileSystem(com.facebook.presto.hive.filesystem.ExtendedFileSystem) HiveWriterFactory.getBucketNumber(com.facebook.presto.hive.HiveWriterFactory.getBucketNumber) ImmutableList.toImmutableList(com.google.common.collect.ImmutableList.toImmutableList) HiveSessionProperties.isStreamingAggregationEnabled(com.facebook.presto.hive.HiveSessionProperties.isStreamingAggregationEnabled) S3SelectPushdown.shouldEnablePushdownForTable(com.facebook.presto.hive.S3SelectPushdown.shouldEnablePushdownForTable) StandardCharsets(java.nio.charset.StandardCharsets) String.format(java.lang.String.format) HiveSessionProperties.getMaxInitialSplitSize(com.facebook.presto.hive.HiveSessionProperties.getMaxInitialSplitSize) FAIL(com.facebook.presto.hive.NestedDirectoryPolicy.FAIL) Preconditions.checkState(com.google.common.base.Preconditions.checkState) ConnectorSession(com.facebook.presto.spi.ConnectorSession) CacheLoader(com.google.common.cache.CacheLoader) List(java.util.List) NOT_SUPPORTED(com.facebook.presto.spi.StandardErrorCode.NOT_SUPPORTED) HIDDEN_FILES_PATH_FILTER(org.apache.hadoop.hive.common.FileUtils.HIDDEN_FILES_PATH_FILTER) Optional(java.util.Optional) Math.max(java.lang.Math.max) CacheBuilder(com.google.common.cache.CacheBuilder) TextInputFormat(org.apache.hadoop.mapred.TextInputFormat) Table(com.facebook.presto.hive.metastore.Table) ListenableFuture(com.google.common.util.concurrent.ListenableFuture) HiveUtil.getFooterCount(com.facebook.presto.hive.HiveUtil.getFooterCount) PathFilter(org.apache.hadoop.fs.PathFilter) MetastoreUtil.getPartitionLocation(com.facebook.presto.hive.metastore.MetastoreUtil.getPartitionLocation) PrestoException(com.facebook.presto.spi.PrestoException) Deque(java.util.Deque) HiveSessionProperties.getNodeSelectionStrategy(com.facebook.presto.hive.HiveSessionProperties.getNodeSelectionStrategy) OptionalInt(java.util.OptionalInt) Iterators(com.google.common.collect.Iterators) Partition(com.facebook.presto.hive.metastore.Partition) SymlinkTextInputFormat(org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat) ArrayList(java.util.ArrayList) IGNORED(com.facebook.presto.hive.NestedDirectoryPolicy.IGNORED) ImmutableList(com.google.common.collect.ImmutableList) HiveUtil.shouldUseFileSplitsFromInputFormat(com.facebook.presto.hive.HiveUtil.shouldUseFileSplitsFromInputFormat) ConfigurationUtils.toJobConf(com.facebook.presto.hive.util.ConfigurationUtils.toJobConf) Objects.requireNonNull(java.util.Objects.requireNonNull) HIVE_BAD_DATA(com.facebook.presto.hive.HiveErrorCode.HIVE_BAD_DATA) RECURSE(com.facebook.presto.hive.NestedDirectoryPolicy.RECURSE) HIVE_INVALID_BUCKET_FILES(com.facebook.presto.hive.HiveErrorCode.HIVE_INVALID_BUCKET_FILES) Futures.immediateFuture(com.google.common.util.concurrent.Futures.immediateFuture) Storage(com.facebook.presto.hive.metastore.Storage) Properties(java.util.Properties) Iterator(java.util.Iterator) InternalHiveSplitFactory(com.facebook.presto.hive.util.InternalHiveSplitFactory) HiveColumnHandle.pathColumnHandle(com.facebook.presto.hive.HiveColumnHandle.pathColumnHandle) HoodieROTablePathFilter(org.apache.hudi.hadoop.HoodieROTablePathFilter) IOException(java.io.IOException) HiveFileIterator(com.facebook.presto.hive.util.HiveFileIterator) InputStreamReader(java.io.InputStreamReader) Domain(com.facebook.presto.common.predicate.Domain) JobConf(org.apache.hadoop.mapred.JobConf) Streams.stream(com.google.common.collect.Streams.stream) InputSplit(org.apache.hadoop.mapred.InputSplit) BufferedReader(java.io.BufferedReader) Maps.fromProperties(com.google.common.collect.Maps.fromProperties) Comparator(java.util.Comparator) HiveBucketing.getVirtualBucketNumber(com.facebook.presto.hive.HiveBucketing.getVirtualBucketNumber) HiveUtil.isHudiParquetInputFormat(com.facebook.presto.hive.HiveUtil.isHudiParquetInputFormat) MetastoreUtil.getHiveSchema(com.facebook.presto.hive.metastore.MetastoreUtil.getHiveSchema) ArrayList(java.util.ArrayList) PrestoException(com.facebook.presto.spi.PrestoException) OptionalInt(java.util.OptionalInt) SchemaTableName(com.facebook.presto.spi.SchemaTableName) HiveFileIterator(com.facebook.presto.hive.util.HiveFileIterator)

Example 3 with InternalHiveSplitFactory

use of com.facebook.presto.hive.util.InternalHiveSplitFactory in project presto by prestodb.

the class StoragePartitionLoader method loadPartition.

@Override
public ListenableFuture<?> loadPartition(HivePartitionMetadata partition, HiveSplitSource hiveSplitSource, boolean stopped) throws IOException {
    String partitionName = partition.getHivePartition().getPartitionId();
    Storage storage = partition.getPartition().map(Partition::getStorage).orElse(table.getStorage());
    Properties schema = getPartitionSchema(table, partition.getPartition());
    String inputFormatName = storage.getStorageFormat().getInputFormat();
    int partitionDataColumnCount = partition.getPartition().map(p -> p.getColumns().size()).orElse(table.getDataColumns().size());
    List<HivePartitionKey> partitionKeys = getPartitionKeys(table, partition.getPartition(), partitionName);
    String location = getPartitionLocation(table, partition.getPartition());
    if (location.isEmpty()) {
        checkState(!shouldCreateFilesForMissingBuckets(table, session), "Empty location is only allowed for empty temporary table when zero-row file is not created");
        return COMPLETED_FUTURE;
    }
    Path path = new Path(location);
    Configuration configuration = hdfsEnvironment.getConfiguration(hdfsContext, path);
    InputFormat<?, ?> inputFormat = getInputFormat(configuration, inputFormatName, false);
    ExtendedFileSystem fs = hdfsEnvironment.getFileSystem(hdfsContext, path);
    boolean s3SelectPushdownEnabled = shouldEnablePushdownForTable(session, table, path.toString(), partition.getPartition());
    if (inputFormat instanceof SymlinkTextInputFormat) {
        if (tableBucketInfo.isPresent()) {
            throw new PrestoException(NOT_SUPPORTED, "Bucketed table in SymlinkTextInputFormat is not yet supported");
        }
        // TODO: This should use an iterator like the HiveFileIterator
        ListenableFuture<?> lastResult = COMPLETED_FUTURE;
        for (Path targetPath : getTargetPathsFromSymlink(fs, path)) {
            // The input should be in TextInputFormat.
            TextInputFormat targetInputFormat = new TextInputFormat();
            // the splits must be generated using the file system for the target path
            // get the configuration for the target path -- it may be a different hdfs instance
            ExtendedFileSystem targetFilesystem = hdfsEnvironment.getFileSystem(hdfsContext, targetPath);
            JobConf targetJob = toJobConf(targetFilesystem.getConf());
            targetJob.setInputFormat(TextInputFormat.class);
            targetInputFormat.configure(targetJob);
            FileInputFormat.setInputPaths(targetJob, targetPath);
            InputSplit[] targetSplits = targetInputFormat.getSplits(targetJob, 0);
            InternalHiveSplitFactory splitFactory = new InternalHiveSplitFactory(targetFilesystem, inputFormat, pathDomain, getNodeSelectionStrategy(session), getMaxInitialSplitSize(session), s3SelectPushdownEnabled, new HiveSplitPartitionInfo(storage, path.toUri(), partitionKeys, partitionName, partitionDataColumnCount, partition.getTableToPartitionMapping(), Optional.empty(), partition.getRedundantColumnDomains()), schedulerUsesHostAddresses, partition.getEncryptionInformation());
            lastResult = addSplitsToSource(targetSplits, splitFactory, hiveSplitSource, stopped);
            if (stopped) {
                return COMPLETED_FUTURE;
            }
        }
        return lastResult;
    }
    Optional<HiveSplit.BucketConversion> bucketConversion = Optional.empty();
    boolean bucketConversionRequiresWorkerParticipation = false;
    if (partition.getPartition().isPresent()) {
        Optional<HiveBucketProperty> partitionBucketProperty = partition.getPartition().get().getStorage().getBucketProperty();
        if (tableBucketInfo.isPresent() && partitionBucketProperty.isPresent()) {
            int tableBucketCount = tableBucketInfo.get().getTableBucketCount();
            int partitionBucketCount = partitionBucketProperty.get().getBucketCount();
            // Here, it's just trying to see if its needs the BucketConversion.
            if (tableBucketCount != partitionBucketCount) {
                bucketConversion = Optional.of(new HiveSplit.BucketConversion(tableBucketCount, partitionBucketCount, tableBucketInfo.get().getBucketColumns()));
                if (tableBucketCount > partitionBucketCount) {
                    bucketConversionRequiresWorkerParticipation = true;
                }
            }
        }
    }
    InternalHiveSplitFactory splitFactory = new InternalHiveSplitFactory(fs, inputFormat, pathDomain, getNodeSelectionStrategy(session), getMaxInitialSplitSize(session), s3SelectPushdownEnabled, new HiveSplitPartitionInfo(storage, path.toUri(), partitionKeys, partitionName, partitionDataColumnCount, partition.getTableToPartitionMapping(), bucketConversionRequiresWorkerParticipation ? bucketConversion : Optional.empty(), partition.getRedundantColumnDomains()), schedulerUsesHostAddresses, partition.getEncryptionInformation());
    if (shouldUseFileSplitsFromInputFormat(inputFormat, configuration, table.getStorage().getLocation())) {
        if (tableBucketInfo.isPresent()) {
            throw new PrestoException(NOT_SUPPORTED, "Presto cannot read bucketed partition in an input format with UseFileSplitsFromInputFormat annotation: " + inputFormat.getClass().getSimpleName());
        }
        JobConf jobConf = toJobConf(configuration);
        FileInputFormat.setInputPaths(jobConf, path);
        // SerDes parameters and Table parameters passing into input format
        fromProperties(schema).forEach(jobConf::set);
        InputSplit[] splits = inputFormat.getSplits(jobConf, 0);
        return addSplitsToSource(splits, splitFactory, hiveSplitSource, stopped);
    }
    PathFilter pathFilter = isHudiParquetInputFormat(inputFormat) ? hoodiePathFilterLoadingCache.getUnchecked(configuration) : path1 -> true;
    // Streaming aggregation works at the granularity of individual files
    // S3 Select pushdown works at the granularity of individual S3 objects,
    // Partial aggregation pushdown works at the granularity of individual files
    // therefore we must not split files when either is enabled.
    // Skip header / footer lines are not splittable except for a special case when skip.header.line.count=1
    boolean splittable = isFileSplittable(session) && !isStreamingAggregationEnabled(session) && !s3SelectPushdownEnabled && !partialAggregationsPushedDown && getFooterCount(schema) == 0 && getHeaderCount(schema) <= 1;
    // Bucketed partitions are fully loaded immediately since all files must be loaded to determine the file to bucket mapping
    if (tableBucketInfo.isPresent()) {
        if (tableBucketInfo.get().isVirtuallyBucketed()) {
            // For virtual bucket, bucket conversion must not be present because there is no physical partition bucket count
            checkState(!bucketConversion.isPresent(), "Virtually bucketed table must not have partitions that are physically bucketed");
            checkState(tableBucketInfo.get().getTableBucketCount() == tableBucketInfo.get().getReadBucketCount(), "Table and read bucket count should be the same for virtual bucket");
            return hiveSplitSource.addToQueue(getVirtuallyBucketedSplits(path, fs, splitFactory, tableBucketInfo.get().getReadBucketCount(), splittable, pathFilter));
        }
        return hiveSplitSource.addToQueue(getBucketedSplits(path, fs, splitFactory, tableBucketInfo.get(), bucketConversion, partitionName, splittable, pathFilter));
    }
    fileIterators.addLast(createInternalHiveSplitIterator(path, fs, splitFactory, splittable, pathFilter, partition.getPartition()));
    return COMPLETED_FUTURE;
}

Also used : ArrayListMultimap(com.google.common.collect.ArrayListMultimap) LoadingCache(com.google.common.cache.LoadingCache) ListMultimap(com.google.common.collect.ListMultimap) HiveSessionProperties.isFileSplittable(com.facebook.presto.hive.HiveSessionProperties.isFileSplittable) HiveSessionProperties.isUseListDirectoryCache(com.facebook.presto.hive.HiveSessionProperties.isUseListDirectoryCache) FileStatus(org.apache.hadoop.fs.FileStatus) IntPredicate(java.util.function.IntPredicate) HiveUtil.getHeaderCount(com.facebook.presto.hive.HiveUtil.getHeaderCount) HiveUtil.getInputFormat(com.facebook.presto.hive.HiveUtil.getInputFormat) Preconditions.checkArgument(com.google.common.base.Preconditions.checkArgument) SchemaTableName(com.facebook.presto.spi.SchemaTableName) FileSplit(org.apache.hadoop.mapred.FileSplit) CharStreams(com.google.common.io.CharStreams) Configuration(org.apache.hadoop.conf.Configuration) InputFormat(org.apache.hadoop.mapred.InputFormat) Path(org.apache.hadoop.fs.Path) HiveMetadata.shouldCreateFilesForMissingBuckets(com.facebook.presto.hive.HiveMetadata.shouldCreateFilesForMissingBuckets) HIVE_INVALID_FILE_NAMES(com.facebook.presto.hive.HiveErrorCode.HIVE_INVALID_FILE_NAMES) Function(com.google.common.base.Function) FileInputFormat(org.apache.hadoop.mapred.FileInputFormat) ExtendedFileSystem(com.facebook.presto.hive.filesystem.ExtendedFileSystem) HiveWriterFactory.getBucketNumber(com.facebook.presto.hive.HiveWriterFactory.getBucketNumber) ImmutableList.toImmutableList(com.google.common.collect.ImmutableList.toImmutableList) HiveSessionProperties.isStreamingAggregationEnabled(com.facebook.presto.hive.HiveSessionProperties.isStreamingAggregationEnabled) S3SelectPushdown.shouldEnablePushdownForTable(com.facebook.presto.hive.S3SelectPushdown.shouldEnablePushdownForTable) StandardCharsets(java.nio.charset.StandardCharsets) String.format(java.lang.String.format) HiveSessionProperties.getMaxInitialSplitSize(com.facebook.presto.hive.HiveSessionProperties.getMaxInitialSplitSize) FAIL(com.facebook.presto.hive.NestedDirectoryPolicy.FAIL) Preconditions.checkState(com.google.common.base.Preconditions.checkState) ConnectorSession(com.facebook.presto.spi.ConnectorSession) CacheLoader(com.google.common.cache.CacheLoader) List(java.util.List) NOT_SUPPORTED(com.facebook.presto.spi.StandardErrorCode.NOT_SUPPORTED) HIDDEN_FILES_PATH_FILTER(org.apache.hadoop.hive.common.FileUtils.HIDDEN_FILES_PATH_FILTER) Optional(java.util.Optional) Math.max(java.lang.Math.max) CacheBuilder(com.google.common.cache.CacheBuilder) TextInputFormat(org.apache.hadoop.mapred.TextInputFormat) Table(com.facebook.presto.hive.metastore.Table) ListenableFuture(com.google.common.util.concurrent.ListenableFuture) HiveUtil.getFooterCount(com.facebook.presto.hive.HiveUtil.getFooterCount) PathFilter(org.apache.hadoop.fs.PathFilter) MetastoreUtil.getPartitionLocation(com.facebook.presto.hive.metastore.MetastoreUtil.getPartitionLocation) PrestoException(com.facebook.presto.spi.PrestoException) Deque(java.util.Deque) HiveSessionProperties.getNodeSelectionStrategy(com.facebook.presto.hive.HiveSessionProperties.getNodeSelectionStrategy) OptionalInt(java.util.OptionalInt) Iterators(com.google.common.collect.Iterators) Partition(com.facebook.presto.hive.metastore.Partition) SymlinkTextInputFormat(org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat) ArrayList(java.util.ArrayList) IGNORED(com.facebook.presto.hive.NestedDirectoryPolicy.IGNORED) ImmutableList(com.google.common.collect.ImmutableList) HiveUtil.shouldUseFileSplitsFromInputFormat(com.facebook.presto.hive.HiveUtil.shouldUseFileSplitsFromInputFormat) ConfigurationUtils.toJobConf(com.facebook.presto.hive.util.ConfigurationUtils.toJobConf) Objects.requireNonNull(java.util.Objects.requireNonNull) HIVE_BAD_DATA(com.facebook.presto.hive.HiveErrorCode.HIVE_BAD_DATA) RECURSE(com.facebook.presto.hive.NestedDirectoryPolicy.RECURSE) HIVE_INVALID_BUCKET_FILES(com.facebook.presto.hive.HiveErrorCode.HIVE_INVALID_BUCKET_FILES) Futures.immediateFuture(com.google.common.util.concurrent.Futures.immediateFuture) Storage(com.facebook.presto.hive.metastore.Storage) Properties(java.util.Properties) Iterator(java.util.Iterator) InternalHiveSplitFactory(com.facebook.presto.hive.util.InternalHiveSplitFactory) HiveColumnHandle.pathColumnHandle(com.facebook.presto.hive.HiveColumnHandle.pathColumnHandle) HoodieROTablePathFilter(org.apache.hudi.hadoop.HoodieROTablePathFilter) IOException(java.io.IOException) HiveFileIterator(com.facebook.presto.hive.util.HiveFileIterator) InputStreamReader(java.io.InputStreamReader) Domain(com.facebook.presto.common.predicate.Domain) JobConf(org.apache.hadoop.mapred.JobConf) Streams.stream(com.google.common.collect.Streams.stream) InputSplit(org.apache.hadoop.mapred.InputSplit) BufferedReader(java.io.BufferedReader) Maps.fromProperties(com.google.common.collect.Maps.fromProperties) Comparator(java.util.Comparator) HiveBucketing.getVirtualBucketNumber(com.facebook.presto.hive.HiveBucketing.getVirtualBucketNumber) HiveUtil.isHudiParquetInputFormat(com.facebook.presto.hive.HiveUtil.isHudiParquetInputFormat) MetastoreUtil.getHiveSchema(com.facebook.presto.hive.metastore.MetastoreUtil.getHiveSchema) PathFilter(org.apache.hadoop.fs.PathFilter) HoodieROTablePathFilter(org.apache.hudi.hadoop.HoodieROTablePathFilter) Configuration(org.apache.hadoop.conf.Configuration) PrestoException(com.facebook.presto.spi.PrestoException) Properties(java.util.Properties) Maps.fromProperties(com.google.common.collect.Maps.fromProperties) SymlinkTextInputFormat(org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat) ConfigurationUtils.toJobConf(com.facebook.presto.hive.util.ConfigurationUtils.toJobConf) JobConf(org.apache.hadoop.mapred.JobConf) InputSplit(org.apache.hadoop.mapred.InputSplit) Path(org.apache.hadoop.fs.Path) InternalHiveSplitFactory(com.facebook.presto.hive.util.InternalHiveSplitFactory) Storage(com.facebook.presto.hive.metastore.Storage) TextInputFormat(org.apache.hadoop.mapred.TextInputFormat) SymlinkTextInputFormat(org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat) ExtendedFileSystem(com.facebook.presto.hive.filesystem.ExtendedFileSystem)

Example 4 with InternalHiveSplitFactory

use of com.facebook.presto.hive.util.InternalHiveSplitFactory in project presto by prestodb.

the class ManifestPartitionLoader method loadPartition.

public ListenableFuture<?> loadPartition(HivePartitionMetadata partition, HiveSplitSource hiveSplitSource, boolean stopped) throws IOException {
    Path path = new Path(getPartitionLocation(table, partition.getPartition()));
    Map<String, String> parameters = partition.getPartition().get().getParameters();
    // TODO: Add support for more manifest versions
    // Verify the manifest version
    verify(VERSION_1.equals(parameters.get(MANIFEST_VERSION)), "Manifest version is not equal to %s", VERSION_1);
    List<String> fileNames = decompressFileNames(parameters.get(FILE_NAMES));
    List<Long> fileSizes = decompressFileSizes(parameters.get(FILE_SIZES));
    // Verify that the count of fileNames and fileSizes are same
    verify(fileNames.size() == fileSizes.size(), "List of fileNames and fileSizes differ in length");
    if (isManifestVerificationEnabled(session)) {
        // Verify that the file names and sizes in manifest are the same as listed by directory lister
        validateManifest(partition, path, fileNames, fileSizes);
    }
    ImmutableList.Builder<HiveFileInfo> fileListBuilder = ImmutableList.builder();
    for (int i = 0; i < fileNames.size(); i++) {
        Path filePath = new Path(path, fileNames.get(i));
        FileStatus fileStatus = new FileStatus(fileSizes.get(i), false, 1, getMaxSplitSize(session).toBytes(), 0, filePath);
        try {
            BlockLocation[] locations = new BlockLocation[] { new BlockLocation(BLOCK_LOCATION_NAMES, BLOCK_LOCATION_HOSTS, 0, fileSizes.get(i)) };
            // It is safe to set extraFileContext as empty because downstream code always checks if its present before proceeding.
            fileListBuilder.add(HiveFileInfo.createHiveFileInfo(new LocatedFileStatus(fileStatus, locations), Optional.empty()));
        } catch (IOException e) {
            throw new UncheckedIOException(e);
        }
    }
    InternalHiveSplitFactory splitFactory = createInternalHiveSplitFactory(table, partition, session, pathDomain, hdfsEnvironment, hdfsContext, schedulerUsesHostAddresses);
    return hiveSplitSource.addToQueue(fileListBuilder.build().stream().map(status -> splitFactory.createInternalHiveSplit(status, true)).filter(Optional::isPresent).map(Optional::get).collect(toImmutableList()));
}

Also used : Path(org.apache.hadoop.fs.Path) InternalHiveSplitFactory(com.facebook.presto.hive.util.InternalHiveSplitFactory) FileStatus(org.apache.hadoop.fs.FileStatus) LocatedFileStatus(org.apache.hadoop.fs.LocatedFileStatus) Optional(java.util.Optional) ImmutableList(com.google.common.collect.ImmutableList) ImmutableList.toImmutableList(com.google.common.collect.ImmutableList.toImmutableList) LocatedFileStatus(org.apache.hadoop.fs.LocatedFileStatus) UncheckedIOException(java.io.UncheckedIOException) IOException(java.io.IOException) UncheckedIOException(java.io.UncheckedIOException) BlockLocation(org.apache.hadoop.fs.BlockLocation)

Aggregations

InternalHiveSplitFactory (com.facebook.presto.hive.util.InternalHiveSplitFactory)4 ImmutableList (com.google.common.collect.ImmutableList)4 ImmutableList.toImmutableList (com.google.common.collect.ImmutableList.toImmutableList)4 IOException (java.io.IOException)4 Optional (java.util.Optional)4 FileStatus (org.apache.hadoop.fs.FileStatus)4 Path (org.apache.hadoop.fs.Path)4 Domain (com.facebook.presto.common.predicate.Domain)3 HiveSessionProperties.getMaxInitialSplitSize (com.facebook.presto.hive.HiveSessionProperties.getMaxInitialSplitSize)3 HiveSessionProperties.getNodeSelectionStrategy (com.facebook.presto.hive.HiveSessionProperties.getNodeSelectionStrategy)3 HiveUtil.getInputFormat (com.facebook.presto.hive.HiveUtil.getInputFormat)3 IGNORED (com.facebook.presto.hive.NestedDirectoryPolicy.IGNORED)3 RECURSE (com.facebook.presto.hive.NestedDirectoryPolicy.RECURSE)3 ExtendedFileSystem (com.facebook.presto.hive.filesystem.ExtendedFileSystem)3 MetastoreUtil.getPartitionLocation (com.facebook.presto.hive.metastore.MetastoreUtil.getPartitionLocation)3 Partition (com.facebook.presto.hive.metastore.Partition)3 Storage (com.facebook.presto.hive.metastore.Storage)3 Table (com.facebook.presto.hive.metastore.Table)3 ConnectorSession (com.facebook.presto.spi.ConnectorSession)3 PrestoException (com.facebook.presto.spi.PrestoException)3