Search in sources :

Example 1 with LimitingInputFormat

use of io.cdap.cdap.etl.batch.preview.LimitingInputFormat in project cdap by caskdata.

the class LimitingConnector method sample.

@Override
public List<StructuredRecord> sample(ConnectorContext context, SampleRequest request) throws IOException {
    InputFormatProvider inputFormatProvider = batchConnector.getInputFormatProvider(context, request);
    // use limiting format to read from the input format
    Map<String, String> configs = LimitingInputFormatProvider.getConfiguration(inputFormatProvider, request.getLimit());
    Configuration hConf = new Configuration();
    hConf.setClassLoader(pluginConfigurer.createClassLoader());
    configs.forEach(hConf::set);
    Job job = Job.getInstance(hConf);
    job.setJobID(new JobID("sample", 0));
    LimitingInputFormat<?, ?> inputFormat = new LimitingInputFormat<>();
    List<InputSplit> splits;
    try {
        splits = inputFormat.getSplits(job);
    } catch (InterruptedException e) {
        throw new IOException(String.format("Unable to get the splits from the input format %s", inputFormatProvider.getInputFormatClassName()));
    }
    List<StructuredRecord> sample = new ArrayList<>();
    // limiting format only has 1 split
    InputSplit split = splits.get(0);
    TaskID taskId = new TaskID(job.getJobID(), TaskType.MAP, 0);
    TaskAttemptContext taskContext = new TaskAttemptContextImpl(hConf, new TaskAttemptID(taskId, 0));
    // create record reader to read the results
    try (RecordReader<?, ?> reader = inputFormat.createRecordReader(split, taskContext)) {
        reader.initialize(split, taskContext);
        while (reader.nextKeyValue()) {
            sample.add(batchConnector.transform(reader.getCurrentKey(), reader.getCurrentValue()));
        }
    } catch (InterruptedException e) {
        throw new IOException(String.format("Unable to read the values from the input format %s", inputFormatProvider.getInputFormatClassName()));
    }
    return sample;
}
Also used : InputFormatProvider(io.cdap.cdap.api.data.batch.InputFormatProvider) LimitingInputFormatProvider(io.cdap.cdap.etl.batch.preview.LimitingInputFormatProvider) TaskID(org.apache.hadoop.mapreduce.TaskID) Configuration(org.apache.hadoop.conf.Configuration) TaskAttemptID(org.apache.hadoop.mapreduce.TaskAttemptID) ArrayList(java.util.ArrayList) LimitingInputFormat(io.cdap.cdap.etl.batch.preview.LimitingInputFormat) TaskAttemptContext(org.apache.hadoop.mapreduce.TaskAttemptContext) IOException(java.io.IOException) StructuredRecord(io.cdap.cdap.api.data.format.StructuredRecord) TaskAttemptContextImpl(org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl) Job(org.apache.hadoop.mapreduce.Job) InputSplit(org.apache.hadoop.mapreduce.InputSplit) JobID(org.apache.hadoop.mapreduce.JobID)

Aggregations

InputFormatProvider (io.cdap.cdap.api.data.batch.InputFormatProvider)1 StructuredRecord (io.cdap.cdap.api.data.format.StructuredRecord)1 LimitingInputFormat (io.cdap.cdap.etl.batch.preview.LimitingInputFormat)1 LimitingInputFormatProvider (io.cdap.cdap.etl.batch.preview.LimitingInputFormatProvider)1 IOException (java.io.IOException)1 ArrayList (java.util.ArrayList)1 Configuration (org.apache.hadoop.conf.Configuration)1 InputSplit (org.apache.hadoop.mapreduce.InputSplit)1 Job (org.apache.hadoop.mapreduce.Job)1 JobID (org.apache.hadoop.mapreduce.JobID)1 TaskAttemptContext (org.apache.hadoop.mapreduce.TaskAttemptContext)1 TaskAttemptID (org.apache.hadoop.mapreduce.TaskAttemptID)1 TaskID (org.apache.hadoop.mapreduce.TaskID)1 TaskAttemptContextImpl (org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl)1