Examples with InputFormat - org.apache.hadoop.mapreduce.InputFormat

Example 11 with InputFormat

use of org.apache.hadoop.mapreduce.InputFormat in project hadoop by apache.

the class TestCombineFileInputFormat method testRecordReaderInit.

@Test
public void testRecordReaderInit() throws InterruptedException, IOException {
    // Test that we properly initialize the child recordreader when
    // CombineFileInputFormat and CombineFileRecordReader are used.
    TaskAttemptID taskId = new TaskAttemptID("jt", 0, TaskType.MAP, 0, 0);
    Configuration conf1 = new Configuration();
    conf1.set(DUMMY_KEY, "STATE1");
    TaskAttemptContext context1 = new TaskAttemptContextImpl(conf1, taskId);
    // This will create a CombineFileRecordReader that itself contains a
    // DummyRecordReader.
    InputFormat inputFormat = new ChildRRInputFormat();
    Path[] files = { new Path("file1") };
    long[] lengths = { 1 };
    CombineFileSplit split = new CombineFileSplit(files, lengths);
    RecordReader rr = inputFormat.createRecordReader(split, context1);
    assertTrue("Unexpected RR type!", rr instanceof CombineFileRecordReader);
    // Verify that the initial configuration is the one being used.
    // Right after construction the dummy key should have value "STATE1"
    assertEquals("Invalid initial dummy key value", "STATE1", rr.getCurrentKey().toString());
    // Switch the active context for the RecordReader...
    Configuration conf2 = new Configuration();
    conf2.set(DUMMY_KEY, "STATE2");
    TaskAttemptContext context2 = new TaskAttemptContextImpl(conf2, taskId);
    rr.initialize(split, context2);
    // And verify that the new context is updated into the child record reader.
    assertEquals("Invalid secondary dummy key value", "STATE2", rr.getCurrentKey().toString());
}

Also used : Path(org.apache.hadoop.fs.Path) Configuration(org.apache.hadoop.conf.Configuration) TaskAttemptID(org.apache.hadoop.mapreduce.TaskAttemptID) InputFormat(org.apache.hadoop.mapreduce.InputFormat) TaskAttemptContextImpl(org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl) RecordReader(org.apache.hadoop.mapreduce.RecordReader) TaskAttemptContext(org.apache.hadoop.mapreduce.TaskAttemptContext) Test(org.junit.Test)

Example 12 with InputFormat

use of org.apache.hadoop.mapreduce.InputFormat in project hadoop by apache.

the class TestMultipleInputs method testAddInputPathWithMapper.

@Test
public void testAddInputPathWithMapper() throws IOException {
    final Job conf = Job.getInstance();
    MultipleInputs.addInputPath(conf, new Path("/foo"), TextInputFormat.class, MapClass.class);
    MultipleInputs.addInputPath(conf, new Path("/bar"), KeyValueTextInputFormat.class, KeyValueMapClass.class);
    final Map<Path, InputFormat> inputs = MultipleInputs.getInputFormatMap(conf);
    final Map<Path, Class<? extends Mapper>> maps = MultipleInputs.getMapperTypeMap(conf);
    assertEquals(TextInputFormat.class, inputs.get(new Path("/foo")).getClass());
    assertEquals(KeyValueTextInputFormat.class, inputs.get(new Path("/bar")).getClass());
    assertEquals(MapClass.class, maps.get(new Path("/foo")));
    assertEquals(KeyValueMapClass.class, maps.get(new Path("/bar")));
}

Also used : Path(org.apache.hadoop.fs.Path) Mapper(org.apache.hadoop.mapreduce.Mapper) InputFormat(org.apache.hadoop.mapreduce.InputFormat) KeyValueTextInputFormat(org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat) Job(org.apache.hadoop.mapreduce.Job) Test(org.junit.Test)

Example 13 with InputFormat

use of org.apache.hadoop.mapreduce.InputFormat in project asterixdb by apache.

the class HDFSReadOperatorDescriptor method createPushRuntime.

@Override
public IOperatorNodePushable createPushRuntime(final IHyracksTaskContext ctx, IRecordDescriptorProvider recordDescProvider, final int partition, final int nPartitions) throws HyracksDataException {
    final List<FileSplit> inputSplits = splitsFactory.getSplits();
    return new AbstractUnaryOutputSourceOperatorNodePushable() {

        private String nodeName = ctx.getJobletContext().getServiceContext().getNodeId();

        private ContextFactory ctxFactory = new ContextFactory();

        @SuppressWarnings("unchecked")
        @Override
        public void initialize() throws HyracksDataException {
            ClassLoader ctxCL = Thread.currentThread().getContextClassLoader();
            try {
                writer.open();
                Thread.currentThread().setContextClassLoader(ctx.getJobletContext().getClassLoader());
                Job job = confFactory.getConf();
                job.getConfiguration().setClassLoader(ctx.getJobletContext().getClassLoader());
                IKeyValueParser parser = tupleParserFactory.createKeyValueParser(ctx);
                InputFormat inputFormat = ReflectionUtils.newInstance(job.getInputFormatClass(), job.getConfiguration());
                int size = inputSplits.size();
                for (int i = 0; i < size; i++) {
                    /**
                         * read all the partitions scheduled to the current node
                         */
                    if (scheduledLocations[i].equals(nodeName)) {
                        /**
                             * pick an unread split to read synchronize among
                             * simultaneous partitions in the same machine
                             */
                        synchronized (executed) {
                            if (executed[i] == false) {
                                executed[i] = true;
                            } else {
                                continue;
                            }
                        }
                        /**
                             * read the split
                             */
                        TaskAttemptContext context = ctxFactory.createContext(job.getConfiguration(), i);
                        context.getConfiguration().setClassLoader(ctx.getJobletContext().getClassLoader());
                        RecordReader reader = inputFormat.createRecordReader(inputSplits.get(i), context);
                        reader.initialize(inputSplits.get(i), context);
                        while (reader.nextKeyValue() == true) {
                            parser.parse(reader.getCurrentKey(), reader.getCurrentValue(), writer, inputSplits.get(i).toString());
                        }
                    }
                }
                parser.close(writer);
            } catch (Throwable th) {
                writer.fail();
                throw new HyracksDataException(th);
            } finally {
                writer.close();
                Thread.currentThread().setContextClassLoader(ctxCL);
            }
        }
    };
}

Also used : AbstractUnaryOutputSourceOperatorNodePushable(org.apache.hyracks.dataflow.std.base.AbstractUnaryOutputSourceOperatorNodePushable) RecordReader(org.apache.hadoop.mapreduce.RecordReader) TaskAttemptContext(org.apache.hadoop.mapreduce.TaskAttemptContext) FileSplit(org.apache.hadoop.mapreduce.lib.input.FileSplit) HyracksDataException(org.apache.hyracks.api.exceptions.HyracksDataException) ContextFactory(org.apache.hyracks.hdfs.ContextFactory) IKeyValueParser(org.apache.hyracks.hdfs.api.IKeyValueParser) InputFormat(org.apache.hadoop.mapreduce.InputFormat) Job(org.apache.hadoop.mapreduce.Job)

Example 14 with InputFormat

use of org.apache.hadoop.mapreduce.InputFormat in project beam by apache.

the class HadoopInputFormatIOTest method testReadersStartWhenZeroRecords.

/**
   * This test validates behavior of
   * {@link HadoopInputFormatBoundedSource.HadoopInputFormatReader#start() start()} method if
   * InputFormat's {@link InputFormat#getSplits() getSplits()} returns InputSplitList having zero
   * records.
   */
@Test
public void testReadersStartWhenZeroRecords() throws Exception {
    InputFormat mockInputFormat = Mockito.mock(EmployeeInputFormat.class);
    EmployeeRecordReader mockReader = Mockito.mock(EmployeeRecordReader.class);
    Mockito.when(mockInputFormat.createRecordReader(Mockito.any(InputSplit.class), Mockito.any(TaskAttemptContext.class))).thenReturn(mockReader);
    Mockito.when(mockReader.nextKeyValue()).thenReturn(false);
    InputSplit mockInputSplit = Mockito.mock(NewObjectsEmployeeInputSplit.class);
    HadoopInputFormatBoundedSource<Text, Employee> boundedSource = new HadoopInputFormatBoundedSource<Text, Employee>(serConf, WritableCoder.of(Text.class), AvroCoder.of(Employee.class), // No key translation required.
    null, // No value translation required.
    null, new SerializableSplit(mockInputSplit));
    boundedSource.setInputFormatObj(mockInputFormat);
    BoundedReader<KV<Text, Employee>> reader = boundedSource.createReader(p.getOptions());
    assertEquals(false, reader.start());
    assertEquals(Double.valueOf(1), reader.getFractionConsumed());
    reader.close();
}

Also used : EmployeeRecordReader(org.apache.beam.sdk.io.hadoop.inputformat.EmployeeInputFormat.EmployeeRecordReader) InputFormat(org.apache.hadoop.mapreduce.InputFormat) HadoopInputFormatBoundedSource(org.apache.beam.sdk.io.hadoop.inputformat.HadoopInputFormatIO.HadoopInputFormatBoundedSource) SerializableSplit(org.apache.beam.sdk.io.hadoop.inputformat.HadoopInputFormatIO.SerializableSplit) TaskAttemptContext(org.apache.hadoop.mapreduce.TaskAttemptContext) Text(org.apache.hadoop.io.Text) KV(org.apache.beam.sdk.values.KV) InputSplit(org.apache.hadoop.mapreduce.InputSplit) NewObjectsEmployeeInputSplit(org.apache.beam.sdk.io.hadoop.inputformat.EmployeeInputFormat.NewObjectsEmployeeInputSplit) Test(org.junit.Test)

Example 15 with InputFormat

use of org.apache.hadoop.mapreduce.InputFormat in project cdap by caskdata.

the class MultiInputFormat method createRecordReader.

@Override
public RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
    MultiInputTaggedSplit taggedInputSplit = (MultiInputTaggedSplit) split;
    ConfigurationUtil.setAll((taggedInputSplit).getInputConfigs(), context.getConfiguration());
    InputFormat<K, V> inputFormat = (InputFormat<K, V>) ReflectionUtils.newInstance(taggedInputSplit.getInputFormatClass(), context.getConfiguration());
    InputSplit inputSplit = taggedInputSplit.getInputSplit();
    // Find the InputFormat and then the RecordReader from the MultiInputTaggedSplit.
    return new DelegatingRecordReader<>(inputFormat.createRecordReader(inputSplit, context));
}

Also used : InputFormat(org.apache.hadoop.mapreduce.InputFormat) InputSplit(org.apache.hadoop.mapreduce.InputSplit)

Aggregations

InputFormat (org.apache.hadoop.mapreduce.InputFormat)16 Path (org.apache.hadoop.fs.Path)9 Configuration (org.apache.hadoop.conf.Configuration)8 InputSplit (org.apache.hadoop.mapreduce.InputSplit)7 Job (org.apache.hadoop.mapreduce.Job)6 Test (org.junit.Test)6 RecordReader (org.apache.hadoop.mapreduce.RecordReader)5 TaskAttemptContext (org.apache.hadoop.mapreduce.TaskAttemptContext)5 HashMap (java.util.HashMap)3 Map (java.util.Map)3 Mapper (org.apache.hadoop.mapreduce.Mapper)3 TaskAttemptID (org.apache.hadoop.mapreduce.TaskAttemptID)3 FileInputFormat (org.apache.hadoop.mapreduce.lib.input.FileInputFormat)3 TaskAttemptContextImpl (org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl)3 ArrayList (java.util.ArrayList)2 List (java.util.List)2 KeyValueTextInputFormat (org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat)2 Input (co.cask.cdap.api.data.batch.Input)1 InputFormatProvider (co.cask.cdap.api.data.batch.InputFormatProvider)1 FormatSpecification (co.cask.cdap.api.data.format.FormatSpecification)1