Search in sources :

Example 76 with JobContext

use of org.apache.hadoop.mapreduce.JobContext in project druid by druid-io.

the class DatasourceInputFormat method getSplits.

@Override
public List<InputSplit> getSplits(JobContext context) throws IOException {
    JobConf conf = new JobConf(context.getConfiguration());
    List<String> dataSources = getDataSources(conf);
    List<InputSplit> splits = new ArrayList<>();
    for (String dataSource : dataSources) {
        List<WindowedDataSegment> segments = getSegments(conf, dataSource);
        if (segments == null || segments.size() == 0) {
            throw new ISE("No segments found to read for dataSource[%s]", dataSource);
        }
        // Note: Each segment is logged separately to avoid creating a huge String if we are loading lots of segments.
        for (int i = 0; i < segments.size(); i++) {
            final WindowedDataSegment segment = segments.get(i);
            logger.info("Segment %,d/%,d for dataSource[%s] has identifier[%s], interval[%s]", i, segments.size(), dataSource, segment.getSegment().getId(), segment.getInterval());
        }
        long maxSize = getMaxSplitSize(conf, dataSource);
        if (maxSize < 0) {
            long totalSize = 0;
            for (WindowedDataSegment segment : segments) {
                totalSize += segment.getSegment().getSize();
            }
            int mapTask = conf.getNumMapTasks();
            if (mapTask > 0) {
                maxSize = totalSize / mapTask;
            }
        }
        if (maxSize > 0) {
            // combining is to happen, let us sort the segments list by size so that they
            // are combined appropriately
            segments.sort(Comparator.comparingLong(s -> s.getSegment().getSize()));
        }
        List<WindowedDataSegment> list = new ArrayList<>();
        long size = 0;
        org.apache.hadoop.mapred.InputFormat fio = supplier.get();
        for (WindowedDataSegment segment : segments) {
            if (size + segment.getSegment().getSize() > maxSize && size > 0) {
                splits.add(toDataSourceSplit(list, fio, conf));
                list = new ArrayList<>();
                size = 0;
            }
            list.add(segment);
            size += segment.getSegment().getSize();
        }
        if (list.size() > 0) {
            splits.add(toDataSourceSplit(list, fio, conf));
        }
    }
    logger.info("Number of splits [%d]", splits.size());
    return splits;
}
Also used : Logger(org.apache.druid.java.util.common.logger.Logger) TextInputFormat(org.apache.hadoop.mapred.TextInputFormat) Arrays(java.util.Arrays) NullWritable(org.apache.hadoop.io.NullWritable) FileSystem(org.apache.hadoop.fs.FileSystem) Supplier(com.google.common.base.Supplier) FileStatus(org.apache.hadoop.fs.FileStatus) ArrayList(java.util.ArrayList) Configuration(org.apache.hadoop.conf.Configuration) Map(java.util.Map) Path(org.apache.hadoop.fs.Path) TypeReference(com.fasterxml.jackson.core.type.TypeReference) JobHelper(org.apache.druid.indexer.JobHelper) TaskAttemptContext(org.apache.hadoop.mapreduce.TaskAttemptContext) HadoopDruidIndexerConfig(org.apache.druid.indexer.HadoopDruidIndexerConfig) InputSplit(org.apache.hadoop.mapreduce.InputSplit) FileInputFormat(org.apache.hadoop.mapred.FileInputFormat) InputFormat(org.apache.hadoop.mapreduce.InputFormat) StringUtils(org.apache.druid.java.util.common.StringUtils) ISE(org.apache.druid.java.util.common.ISE) IOException(java.io.IOException) Collectors(java.util.stream.Collectors) RecordReader(org.apache.hadoop.mapreduce.RecordReader) JobConf(org.apache.hadoop.mapred.JobConf) InputRow(org.apache.druid.data.input.InputRow) List(java.util.List) Stream(java.util.stream.Stream) JobContext(org.apache.hadoop.mapreduce.JobContext) VisibleForTesting(com.google.common.annotations.VisibleForTesting) Comparator(java.util.Comparator) Collections(java.util.Collections) ArrayList(java.util.ArrayList) ISE(org.apache.druid.java.util.common.ISE) JobConf(org.apache.hadoop.mapred.JobConf) InputSplit(org.apache.hadoop.mapreduce.InputSplit)

Example 77 with JobContext

use of org.apache.hadoop.mapreduce.JobContext in project mongo-hadoop by mongodb.

the class GridFSInputFormatTest method mockJobContext.

private static JobContext mockJobContext(final Configuration conf) {
    JobContext context = mock(JobContext.class);
    when(context.getConfiguration()).thenReturn(conf);
    return context;
}
Also used : JobContext(org.apache.hadoop.mapreduce.JobContext)

Example 78 with JobContext

use of org.apache.hadoop.mapreduce.JobContext in project mongo-hadoop by mongodb.

the class GridFSInputFormatTest method testReadBinaryFiles.

@Test
public void testReadBinaryFiles() throws IOException, InterruptedException, URISyntaxException {
    Configuration conf = getConfiguration();
    MongoConfigUtil.setQuery(conf, new BasicDBObject("filename", "orders.bson"));
    MongoConfigUtil.setGridFSWholeFileSplit(conf, true);
    MongoConfigUtil.setGridFSReadBinary(conf, true);
    JobContext context = mockJobContext(conf);
    TaskAttemptContext taskContext = mockTaskAttemptContext(conf);
    List<InputSplit> splits = inputFormat.getSplits(context);
    assertEquals(1, splits.size());
    int i = 0;
    byte[] buff = null;
    for (InputSplit split : splits) {
        GridFSInputFormat.GridFSBinaryRecordReader reader = new GridFSInputFormat.GridFSBinaryRecordReader();
        reader.initialize(split, taskContext);
        for (; reader.nextKeyValue(); ++i) {
            buff = new byte[reader.getCurrentValue().getLength()];
            // BytesWritable.copyBytes does not exist in Hadoop 1.2
            System.arraycopy(reader.getCurrentValue().getBytes(), 0, buff, 0, buff.length);
        }
    }
    // Only one record to read on the split.
    assertEquals(1, i);
    assertNotNull(buff);
    assertEquals(bson.getLength(), buff.length);
}
Also used : BasicDBObject(com.mongodb.BasicDBObject) Configuration(org.apache.hadoop.conf.Configuration) TaskAttemptContext(org.apache.hadoop.mapreduce.TaskAttemptContext) JobContext(org.apache.hadoop.mapreduce.JobContext) InputSplit(org.apache.hadoop.mapreduce.InputSplit) Test(org.junit.Test) BaseHadoopTest(com.mongodb.hadoop.testutils.BaseHadoopTest)

Example 79 with JobContext

use of org.apache.hadoop.mapreduce.JobContext in project hive by apache.

the class KuduInputFormat method computeSplits.

private List<KuduInputSplit> computeSplits(Configuration conf) throws IOException {
    try (KuduClient client = KuduHiveUtils.getKuduClient(conf)) {
        // Hive depends on FileSplits so we get the dummy Path for the Splits.
        Job job = Job.getInstance(conf);
        JobContext jobContext = ShimLoader.getHadoopShims().newJobContext(job);
        Path[] paths = FileInputFormat.getInputPaths(jobContext);
        Path dummyPath = paths[0];
        String tableName = conf.get(KUDU_TABLE_NAME_KEY);
        if (StringUtils.isEmpty(tableName)) {
            throw new IllegalArgumentException(KUDU_TABLE_NAME_KEY + " is not set.");
        }
        if (!client.tableExists(tableName)) {
            throw new IllegalArgumentException("Kudu table does not exist: " + tableName);
        }
        KuduTable table = client.openTable(tableName);
        List<KuduPredicate> predicates = KuduPredicateHandler.getPredicates(conf, table.getSchema());
        KuduScanToken.KuduScanTokenBuilder tokenBuilder = client.newScanTokenBuilder(table).setProjectedColumnNames(getProjectedColumns(conf));
        for (KuduPredicate predicate : predicates) {
            tokenBuilder.addPredicate(predicate);
        }
        List<KuduScanToken> tokens = tokenBuilder.build();
        List<KuduInputSplit> splits = new ArrayList<>(tokens.size());
        for (KuduScanToken token : tokens) {
            List<String> locations = new ArrayList<>(token.getTablet().getReplicas().size());
            for (LocatedTablet.Replica replica : token.getTablet().getReplicas()) {
                locations.add(replica.getRpcHost());
            }
            splits.add(new KuduInputSplit(token, dummyPath, locations.toArray(new String[0])));
        }
        return splits;
    }
}
Also used : Path(org.apache.hadoop.fs.Path) KuduScanToken(org.apache.kudu.client.KuduScanToken) ArrayList(java.util.ArrayList) KuduTable(org.apache.kudu.client.KuduTable) LocatedTablet(org.apache.kudu.client.LocatedTablet) KuduPredicate(org.apache.kudu.client.KuduPredicate) KuduClient(org.apache.kudu.client.KuduClient) JobContext(org.apache.hadoop.mapreduce.JobContext) Job(org.apache.hadoop.mapreduce.Job)

Example 80 with JobContext

use of org.apache.hadoop.mapreduce.JobContext in project hive by apache.

the class HiveHFileOutputFormat method checkOutputSpecs.

@Override
public void checkOutputSpecs(FileSystem ignored, JobConf jc) throws IOException {
    // delegate to the new api
    Job job = new Job(jc);
    JobContext jobContext = ShimLoader.getHadoopShims().newJobContext(job);
    checkOutputSpecs(jobContext);
}
Also used : JobContext(org.apache.hadoop.mapreduce.JobContext) Job(org.apache.hadoop.mapreduce.Job)

Aggregations

JobContext (org.apache.hadoop.mapreduce.JobContext)85 Configuration (org.apache.hadoop.conf.Configuration)41 Job (org.apache.hadoop.mapreduce.Job)35 TaskAttemptContext (org.apache.hadoop.mapreduce.TaskAttemptContext)34 Test (org.junit.Test)31 JobContextImpl (org.apache.hadoop.mapreduce.task.JobContextImpl)29 InputSplit (org.apache.hadoop.mapreduce.InputSplit)28 TaskAttemptContextImpl (org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl)25 Path (org.apache.hadoop.fs.Path)24 IOException (java.io.IOException)22 File (java.io.File)19 TaskAttemptID (org.apache.hadoop.mapreduce.TaskAttemptID)16 ArrayList (java.util.ArrayList)13 RecordWriter (org.apache.hadoop.mapreduce.RecordWriter)11 JobConf (org.apache.hadoop.mapred.JobConf)10 OutputCommitter (org.apache.hadoop.mapreduce.OutputCommitter)10 LongWritable (org.apache.hadoop.io.LongWritable)9 MapFile (org.apache.hadoop.io.MapFile)9 JobID (org.apache.hadoop.mapreduce.JobID)7 FileSystem (org.apache.hadoop.fs.FileSystem)6