Search in sources :

Example 1 with SparkRecordCollectionImpl

use of io.cdap.cdap.etl.api.sql.engine.dataset.SparkRecordCollectionImpl in project cdap by caskdata.

the class BatchSQLEngineAdapter method pushInternal.

/**
 * Push implementation. This method has blocking calls and should be executed in a separate thread.
 *
 * @param datasetName name of the dataset to push.
 * @param schema      the record schema.
 * @param collection  the collection containing the records to push.
 * @return {@link SQLDataset} instance representing the pushed records.
 * @throws SQLEngineException if the push operation fails.
 */
@SuppressWarnings("unchecked")
public SQLDataset pushInternal(String datasetName, Schema schema, SparkCollection<?> collection) throws SQLEngineException {
    // Create push request
    SQLPushRequest pushRequest = new SQLPushRequest(datasetName, schema);
    // If so, we will process this request using a consumer.
    for (PushCapability capability : sqlEngine.getPushCapabilities()) {
        SQLDatasetConsumer consumer = sqlEngine.getConsumer(pushRequest, capability);
        // If a consumer is able to consume this request, we delegate the execution to the consumer.
        if (consumer != null) {
            StructType sparkSchema = DataFrames.toDataType(schema);
            JavaRDD<Row> rowRDD = ((JavaRDD<StructuredRecord>) collection.getUnderlying()).map(r -> DataFrames.toRow(r, sparkSchema));
            Dataset<Row> ds = sqlContext.createDataFrame(rowRDD, sparkSchema);
            RecordCollection recordCollection = new SparkRecordCollectionImpl(ds);
            return consumer.consume(recordCollection);
        }
    }
    // If no capabilities could be used to produce records, proceed using the Push Provider.
    SQLPushDataset<StructuredRecord, ?, ?> pushDataset = sqlEngine.getPushProvider(pushRequest);
    // Write records using the Push provider.
    JavaPairRDD<?, ?> pairRdd = ((JavaRDD) collection.getUnderlying()).flatMapToPair(new TransformToPairFunction<>(pushDataset.toKeyValue()));
    RDDUtils.saveUsingOutputFormat(pushDataset, pairRdd);
    return pushDataset;
}
Also used : StructType(org.apache.spark.sql.types.StructType) SQLPushRequest(io.cdap.cdap.etl.api.engine.sql.request.SQLPushRequest) StructuredRecord(io.cdap.cdap.api.data.format.StructuredRecord) JavaRDD(org.apache.spark.api.java.JavaRDD) PushCapability(io.cdap.cdap.etl.api.engine.sql.capability.PushCapability) SQLDatasetConsumer(io.cdap.cdap.etl.api.engine.sql.dataset.SQLDatasetConsumer) RecordCollection(io.cdap.cdap.etl.api.engine.sql.dataset.RecordCollection) SparkRecordCollection(io.cdap.cdap.etl.api.sql.engine.dataset.SparkRecordCollection) SparkRecordCollectionImpl(io.cdap.cdap.etl.api.sql.engine.dataset.SparkRecordCollectionImpl) Row(org.apache.spark.sql.Row)

Example 2 with SparkRecordCollectionImpl

use of io.cdap.cdap.etl.api.sql.engine.dataset.SparkRecordCollectionImpl in project cdap by caskdata.

the class MockPullProducer method produce.

@Override
public RecordCollection produce(SQLDataset dataset) {
    // Create a spark session and write RDD as JSON
    TypeToken<HashSet<StructuredRecord>> typeToken = new TypeToken<HashSet<StructuredRecord>>() {
    };
    Type setOfStructuredRecordType = typeToken.getType();
    // Read records from JSON and adjust data types
    Set<StructuredRecord> jsonRecords = GSON.fromJson(expected, setOfStructuredRecordType);
    Set<StructuredRecord> records = new HashSet<>();
    for (StructuredRecord jsonRecord : jsonRecords) {
        records.add(transform(jsonRecord, jsonRecord.getSchema()));
    }
    // Build RDD and generate a new Recrd Collection.
    SparkContext sc = SparkContext.getOrCreate();
    JavaSparkContext jsc = JavaSparkContext.fromSparkContext(SparkContext.getOrCreate());
    SQLContext sqlContext = new SQLContext(sc);
    StructType sparkSchema = DataFrames.toDataType(this.datasetDescription.getSchema());
    JavaRDD<Row> rdd = jsc.parallelize(new ArrayList<>(records)).map(sr -> DataFrames.toRow(sr, sparkSchema));
    Dataset<Row> ds = sqlContext.createDataFrame(rdd.rdd(), sparkSchema);
    return new SparkRecordCollectionImpl(ds);
}
Also used : StructType(org.apache.spark.sql.types.StructType) ArrayList(java.util.ArrayList) StructuredRecord(io.cdap.cdap.api.data.format.StructuredRecord) StructType(org.apache.spark.sql.types.StructType) Type(java.lang.reflect.Type) SparkContext(org.apache.spark.SparkContext) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) TypeToken(com.google.common.reflect.TypeToken) SparkRecordCollectionImpl(io.cdap.cdap.etl.api.sql.engine.dataset.SparkRecordCollectionImpl) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) Row(org.apache.spark.sql.Row) SQLContext(org.apache.spark.sql.SQLContext) HashSet(java.util.HashSet)

Aggregations

StructuredRecord (io.cdap.cdap.api.data.format.StructuredRecord)2 SparkRecordCollectionImpl (io.cdap.cdap.etl.api.sql.engine.dataset.SparkRecordCollectionImpl)2 Row (org.apache.spark.sql.Row)2 StructType (org.apache.spark.sql.types.StructType)2 TypeToken (com.google.common.reflect.TypeToken)1 PushCapability (io.cdap.cdap.etl.api.engine.sql.capability.PushCapability)1 RecordCollection (io.cdap.cdap.etl.api.engine.sql.dataset.RecordCollection)1 SQLDatasetConsumer (io.cdap.cdap.etl.api.engine.sql.dataset.SQLDatasetConsumer)1 SQLPushRequest (io.cdap.cdap.etl.api.engine.sql.request.SQLPushRequest)1 SparkRecordCollection (io.cdap.cdap.etl.api.sql.engine.dataset.SparkRecordCollection)1 Type (java.lang.reflect.Type)1 ArrayList (java.util.ArrayList)1 HashSet (java.util.HashSet)1 SparkContext (org.apache.spark.SparkContext)1 JavaRDD (org.apache.spark.api.java.JavaRDD)1 JavaSparkContext (org.apache.spark.api.java.JavaSparkContext)1 SQLContext (org.apache.spark.sql.SQLContext)1