Search in sources :

Example 1 with RecordScannable

use of co.cask.cdap.api.data.batch.RecordScannable in project cdap by caskdata.

the class DatasetInputFormat method getSplits.

@Override
public InputSplit[] getSplits(JobConf jobConf, int numSplits) throws IOException {
    try (DatasetAccessor datasetAccessor = new DatasetAccessor(jobConf)) {
        try {
            datasetAccessor.initialize();
        } catch (Exception e) {
            throw new IOException("Could not get dataset", e);
        }
        try (RecordScannable recordScannable = datasetAccessor.getDataset()) {
            Job job = new Job(jobConf);
            JobContext jobContext = ShimLoader.getHadoopShims().newJobContext(job);
            Path[] tablePaths = FileInputFormat.getInputPaths(jobContext);
            List<Split> dsSplits = recordScannable.getSplits();
            InputSplit[] inputSplits = new InputSplit[dsSplits.size()];
            for (int i = 0; i < dsSplits.size(); i++) {
                inputSplits[i] = new DatasetInputSplit(dsSplits.get(i), tablePaths[0]);
            }
            return inputSplits;
        }
    }
}
Also used : Path(org.apache.hadoop.fs.Path) IOException(java.io.IOException) IOException(java.io.IOException) RecordScannable(co.cask.cdap.api.data.batch.RecordScannable) JobContext(org.apache.hadoop.mapreduce.JobContext) Job(org.apache.hadoop.mapreduce.Job) Split(co.cask.cdap.api.data.batch.Split) FileSplit(org.apache.hadoop.mapred.FileSplit) InputSplit(org.apache.hadoop.mapred.InputSplit) InputSplit(org.apache.hadoop.mapred.InputSplit)

Example 2 with RecordScannable

use of co.cask.cdap.api.data.batch.RecordScannable in project cdap by caskdata.

the class DatasetSerDe method getDatasetSchema.

private void getDatasetSchema(Configuration conf, DatasetId datasetId) throws SerDeException {
    try (ContextManager.Context hiveContext = ContextManager.getContext(conf)) {
        // Because it calls initialize just to get the object inspector
        if (hiveContext == null) {
            LOG.info("Hive provided a null conf, will not be able to get dataset schema.");
            return;
        }
        // some datasets like Table and ObjectMappedTable have schema in the dataset properties
        try {
            DatasetSpecification datasetSpec = hiveContext.getDatasetSpec(datasetId);
            String schemaStr = datasetSpec.getProperty("schema");
            if (schemaStr != null) {
                schema = Schema.parseJson(schemaStr);
                return;
            }
        } catch (DatasetManagementException | ServiceUnavailableException e) {
            throw new SerDeException("Could not instantiate dataset " + datasetId, e);
        } catch (IOException e) {
            throw new SerDeException("Exception getting schema for dataset " + datasetId, e);
        }
        // other datasets must be instantiated to get their schema
        // conf is null if this is a query that writes to a dataset
        ClassLoader parentClassLoader = conf == null ? null : conf.getClassLoader();
        try (SystemDatasetInstantiator datasetInstantiator = hiveContext.createDatasetInstantiator(parentClassLoader)) {
            Dataset dataset = datasetInstantiator.getDataset(datasetId);
            if (dataset == null) {
                throw new SerDeException("Could not find dataset " + datasetId);
            }
            Type recordType;
            if (dataset instanceof RecordScannable) {
                recordType = ((RecordScannable) dataset).getRecordType();
            } else if (dataset instanceof RecordWritable) {
                recordType = ((RecordWritable) dataset).getRecordType();
            } else {
                throw new SerDeException("Dataset " + datasetId + " is not explorable.");
            }
            schema = schemaGenerator.generate(recordType);
        } catch (UnsupportedTypeException e) {
            throw new SerDeException("Dataset " + datasetId + " has an unsupported schema.", e);
        } catch (IOException e) {
            throw new SerDeException("Exception while trying to instantiate dataset " + datasetId, e);
        }
    } catch (IOException e) {
        throw new SerDeException("Could not get hive context from configuration.", e);
    }
}
Also used : RecordWritable(co.cask.cdap.api.data.batch.RecordWritable) Dataset(co.cask.cdap.api.dataset.Dataset) DatasetSpecification(co.cask.cdap.api.dataset.DatasetSpecification) ServiceUnavailableException(co.cask.cdap.common.ServiceUnavailableException) IOException(java.io.IOException) RecordScannable(co.cask.cdap.api.data.batch.RecordScannable) DatasetManagementException(co.cask.cdap.api.dataset.DatasetManagementException) Type(java.lang.reflect.Type) SystemDatasetInstantiator(co.cask.cdap.data.dataset.SystemDatasetInstantiator) ContextManager(co.cask.cdap.hive.context.ContextManager) UnsupportedTypeException(co.cask.cdap.api.data.schema.UnsupportedTypeException) SerDeException(org.apache.hadoop.hive.serde2.SerDeException)

Example 3 with RecordScannable

use of co.cask.cdap.api.data.batch.RecordScannable in project cdap by caskdata.

the class ExploreTableManager method generateEnableStatement.

/**
 * Generate a Hive DDL statement to create a Hive table for the given dataset.
 *
 * @param dataset the instantiated dataset
 * @param spec the dataset specification
 * @param datasetId the dataset id
 * @param truncating whether this call to create() is part of a truncate() operation, which is in some
 *                   case implemented using disableExplore() followed by enableExplore()
 *
 * @return a CREATE TABLE statement, or null if the dataset is not explorable
 * @throws UnsupportedTypeException if the dataset is a RecordScannable of a type that is not supported by Hive
 */
@Nullable
private String generateEnableStatement(Dataset dataset, DatasetSpecification spec, DatasetId datasetId, String tableName, boolean truncating) throws UnsupportedTypeException, ExploreException {
    String datasetName = datasetId.getDataset();
    Map<String, String> serdeProperties = ImmutableMap.of(Constants.Explore.DATASET_NAME, datasetId.getDataset(), Constants.Explore.DATASET_NAMESPACE, datasetId.getNamespace());
    // or it must be a FileSet or a PartitionedFileSet with explore enabled in it properties.
    if (dataset instanceof Table) {
        // valid for a table not to have a schema property. this logic should really be in Table
        return generateCreateStatementFromSchemaProperty(spec, datasetId, tableName, serdeProperties, false);
    }
    if (dataset instanceof ObjectMappedTable) {
        return generateCreateStatementFromSchemaProperty(spec, datasetId, tableName, serdeProperties, true);
    }
    boolean isRecordScannable = dataset instanceof RecordScannable;
    boolean isRecordWritable = dataset instanceof RecordWritable;
    if (isRecordScannable || isRecordWritable) {
        Type recordType = isRecordScannable ? ((RecordScannable) dataset).getRecordType() : ((RecordWritable) dataset).getRecordType();
        // Use == because that's what same class means.
        if (StructuredRecord.class == recordType) {
            return generateCreateStatementFromSchemaProperty(spec, datasetId, tableName, serdeProperties, true);
        }
        // otherwise, derive the schema from the record type
        LOG.debug("Enabling explore for dataset instance {}", datasetName);
        String databaseName = ExploreProperties.getExploreDatabaseName(spec.getProperties());
        return new CreateStatementBuilder(datasetName, databaseName, tableName, shouldEscapeColumns).setSchema(hiveSchemaFor(recordType)).setTableComment("CDAP Dataset").buildWithStorageHandler(DatasetStorageHandler.class.getName(), serdeProperties);
    } else if (dataset instanceof FileSet || dataset instanceof PartitionedFileSet) {
        Map<String, String> properties = spec.getProperties();
        if (FileSetProperties.isExploreEnabled(properties)) {
            LOG.debug("Enabling explore for dataset instance {}", datasetName);
            return generateFileSetCreateStatement(datasetId, dataset, properties, truncating);
        }
    }
    // dataset is not explorable
    return null;
}
Also used : Table(co.cask.cdap.api.dataset.table.Table) ObjectMappedTable(co.cask.cdap.api.dataset.lib.ObjectMappedTable) RecordWritable(co.cask.cdap.api.data.batch.RecordWritable) FileSet(co.cask.cdap.api.dataset.lib.FileSet) PartitionedFileSet(co.cask.cdap.api.dataset.lib.PartitionedFileSet) CreateStatementBuilder(co.cask.cdap.explore.table.CreateStatementBuilder) PartitionedFileSet(co.cask.cdap.api.dataset.lib.PartitionedFileSet) RecordScannable(co.cask.cdap.api.data.batch.RecordScannable) Type(java.lang.reflect.Type) DatasetStorageHandler(co.cask.cdap.hive.datasets.DatasetStorageHandler) ObjectMappedTable(co.cask.cdap.api.dataset.lib.ObjectMappedTable) Map(java.util.Map) ImmutableMap(com.google.common.collect.ImmutableMap) Nullable(javax.annotation.Nullable)

Aggregations

RecordScannable (co.cask.cdap.api.data.batch.RecordScannable)3 RecordWritable (co.cask.cdap.api.data.batch.RecordWritable)2 IOException (java.io.IOException)2 Type (java.lang.reflect.Type)2 Split (co.cask.cdap.api.data.batch.Split)1 UnsupportedTypeException (co.cask.cdap.api.data.schema.UnsupportedTypeException)1 Dataset (co.cask.cdap.api.dataset.Dataset)1 DatasetManagementException (co.cask.cdap.api.dataset.DatasetManagementException)1 DatasetSpecification (co.cask.cdap.api.dataset.DatasetSpecification)1 FileSet (co.cask.cdap.api.dataset.lib.FileSet)1 ObjectMappedTable (co.cask.cdap.api.dataset.lib.ObjectMappedTable)1 PartitionedFileSet (co.cask.cdap.api.dataset.lib.PartitionedFileSet)1 Table (co.cask.cdap.api.dataset.table.Table)1 ServiceUnavailableException (co.cask.cdap.common.ServiceUnavailableException)1 SystemDatasetInstantiator (co.cask.cdap.data.dataset.SystemDatasetInstantiator)1 CreateStatementBuilder (co.cask.cdap.explore.table.CreateStatementBuilder)1 ContextManager (co.cask.cdap.hive.context.ContextManager)1 DatasetStorageHandler (co.cask.cdap.hive.datasets.DatasetStorageHandler)1 ImmutableMap (com.google.common.collect.ImmutableMap)1 Map (java.util.Map)1