Search in sources :

Example 6 with DataSourceDescriptor

use of org.apache.tez.dag.api.DataSourceDescriptor in project tez by apache.

the class TestMRInputHelpers method testInputSplitLocalResourceCreationWithDifferentFS.

@Test(timeout = 5000)
public void testInputSplitLocalResourceCreationWithDifferentFS() throws Exception {
    FileSystem localFs = FileSystem.getLocal(conf);
    Path LOCAL_TEST_ROOT_DIR = new Path("target" + Path.SEPARATOR + TestMRHelpers.class.getName() + "-localtmpDir");
    try {
        localFs.mkdirs(LOCAL_TEST_ROOT_DIR);
        Path splitsDir = localFs.resolvePath(LOCAL_TEST_ROOT_DIR);
        DataSourceDescriptor dataSource = generateDataSourceDescriptorMapRed(splitsDir);
        Map<String, LocalResource> localResources = dataSource.getAdditionalLocalFiles();
        Assert.assertEquals(2, localResources.size());
        Assert.assertTrue(localResources.containsKey(MRInputHelpers.JOB_SPLIT_RESOURCE_NAME));
        Assert.assertTrue(localResources.containsKey(MRInputHelpers.JOB_SPLIT_METAINFO_RESOURCE_NAME));
        for (LocalResource lr : localResources.values()) {
            Assert.assertFalse(lr.getResource().getScheme().contains(remoteFs.getScheme()));
        }
    } finally {
        localFs.delete(LOCAL_TEST_ROOT_DIR, true);
    }
}
Also used : Path(org.apache.hadoop.fs.Path) FileSystem(org.apache.hadoop.fs.FileSystem) DataSourceDescriptor(org.apache.tez.dag.api.DataSourceDescriptor) LocalResource(org.apache.hadoop.yarn.api.records.LocalResource) Test(org.junit.Test)

Example 7 with DataSourceDescriptor

use of org.apache.tez.dag.api.DataSourceDescriptor in project tez by apache.

the class TestMRInput method test0PhysicalInputs.

@Test(timeout = 5000)
public void test0PhysicalInputs() throws IOException {
    InputContext inputContext = mock(InputContext.class);
    DataSourceDescriptor dsd = MRInput.createConfigBuilder(new Configuration(false), FileInputFormat.class, "testPath").build();
    ApplicationId applicationId = ApplicationId.newInstance(1000, 1);
    doReturn(dsd.getInputDescriptor().getUserPayload()).when(inputContext).getUserPayload();
    doReturn(applicationId).when(inputContext).getApplicationId();
    doReturn("dagName").when(inputContext).getDAGName();
    doReturn("vertexName").when(inputContext).getTaskVertexName();
    doReturn("inputName").when(inputContext).getSourceVertexName();
    doReturn("uniqueIdentifier").when(inputContext).getUniqueIdentifier();
    doReturn(1).when(inputContext).getTaskIndex();
    doReturn(1).when(inputContext).getTaskAttemptNumber();
    doReturn(new TezCounters()).when(inputContext).getCounters();
    MRInput mrInput = new MRInput(inputContext, 0);
    mrInput.initialize();
    mrInput.start();
    assertFalse(mrInput.getReader().next());
    verify(inputContext, times(1)).notifyProgress();
    List<Event> events = new LinkedList<>();
    try {
        mrInput.handleEvents(events);
        fail("HandleEvents should cause an input with 0 physical inputs to fail");
    } catch (Exception e) {
        assertTrue(e instanceof IllegalStateException);
    }
}
Also used : Configuration(org.apache.hadoop.conf.Configuration) InputContext(org.apache.tez.runtime.api.InputContext) Event(org.apache.tez.runtime.api.Event) InputDataInformationEvent(org.apache.tez.runtime.api.events.InputDataInformationEvent) ApplicationId(org.apache.hadoop.yarn.api.records.ApplicationId) FileInputFormat(org.apache.hadoop.mapreduce.lib.input.FileInputFormat) TezCounters(org.apache.tez.common.counters.TezCounters) LinkedList(java.util.LinkedList) IOException(java.io.IOException) DataSourceDescriptor(org.apache.tez.dag.api.DataSourceDescriptor) Test(org.junit.Test)

Example 8 with DataSourceDescriptor

use of org.apache.tez.dag.api.DataSourceDescriptor in project tez by apache.

the class MRInputHelpers method configureMRInputWithLegacySplitGeneration.

/**
 * Setup split generation on the client, with splits being distributed via the traditional
 * MapReduce mechanism of distributing splits via the Distributed Cache.
 * <p/>
 * Usage of this technique for handling splits is not advised. Instead, splits should be either
 * generated in the AM, or generated in the client and distributed via the AM. See {@link
 * org.apache.tez.mapreduce.input.MRInput.MRInputConfigBuilder}
 * <p/>
 * Note: Attempting to use this method to add multiple Inputs to a Vertex is not supported.
 *
 * This mechanism of propagating splits may be removed in a subsequent release, and is not recommended.
 *
 * @param conf           configuration to be used by {@link org.apache.tez.mapreduce.input.MRInput}.
 *                       This is expected to be fully configured.
 * @param splitsDir      the path to which splits will be generated.
 * @param useLegacyInput whether to use {@link org.apache.tez.mapreduce.input.MRInputLegacy} or
 *                       {@link org.apache.tez.mapreduce.input.MRInput}
 * @return an instance of {@link org.apache.tez.dag.api.DataSourceDescriptor} which can be added
 * as a data source to a {@link org.apache.tez.dag.api.Vertex}
 */
@InterfaceStability.Unstable
@InterfaceAudience.LimitedPrivate({ "hive, pig" })
public static DataSourceDescriptor configureMRInputWithLegacySplitGeneration(Configuration conf, Path splitsDir, boolean useLegacyInput) {
    InputSplitInfo inputSplitInfo = null;
    try {
        inputSplitInfo = generateInputSplits(conf, splitsDir);
        InputDescriptor inputDescriptor = InputDescriptor.create(useLegacyInput ? MRInputLegacy.class.getName() : MRInput.class.getName()).setUserPayload(createMRInputPayload(conf, null, false, true));
        Map<String, LocalResource> additionalLocalResources = new HashMap<String, LocalResource>();
        updateLocalResourcesForInputSplits(conf, inputSplitInfo, additionalLocalResources);
        DataSourceDescriptor dsd = DataSourceDescriptor.create(inputDescriptor, null, inputSplitInfo.getNumTasks(), inputSplitInfo.getCredentials(), VertexLocationHint.create(inputSplitInfo.getTaskLocationHints()), additionalLocalResources);
        return dsd;
    } catch (IOException e) {
        throw new TezUncheckedException("Failed to generate InputSplits", e);
    } catch (InterruptedException e) {
        throw new TezUncheckedException("Failed to generate InputSplits", e);
    } catch (ClassNotFoundException e) {
        throw new TezUncheckedException("Failed to generate InputSplits", e);
    }
}
Also used : InputDescriptor(org.apache.tez.dag.api.InputDescriptor) MRInput(org.apache.tez.mapreduce.input.MRInput) TezUncheckedException(org.apache.tez.dag.api.TezUncheckedException) HashMap(java.util.HashMap) ByteString(com.google.protobuf.ByteString) IOException(java.io.IOException) LocalResource(org.apache.hadoop.yarn.api.records.LocalResource) MRInputLegacy(org.apache.tez.mapreduce.input.MRInputLegacy) DataSourceDescriptor(org.apache.tez.dag.api.DataSourceDescriptor) Unstable(org.apache.hadoop.classification.InterfaceStability.Unstable)

Example 9 with DataSourceDescriptor

use of org.apache.tez.dag.api.DataSourceDescriptor in project tez by apache.

the class OrderedWordCount method createDAG.

public static DAG createDAG(TezConfiguration tezConf, String inputPath, String outputPath, int numPartitions, boolean disableSplitGrouping, boolean isGenerateSplitInClient, String dagName) throws IOException {
    DataSourceDescriptor dataSource = MRInput.createConfigBuilder(new Configuration(tezConf), TextInputFormat.class, inputPath).groupSplits(!disableSplitGrouping).generateSplitsInAM(!isGenerateSplitInClient).build();
    DataSinkDescriptor dataSink = MROutput.createConfigBuilder(new Configuration(tezConf), TextOutputFormat.class, outputPath).build();
    Vertex tokenizerVertex = Vertex.create(TOKENIZER, ProcessorDescriptor.create(TokenProcessor.class.getName()));
    tokenizerVertex.addDataSource(INPUT, dataSource);
    // Use Text key and IntWritable value to bring counts for each word in the same partition
    // The setFromConfiguration call is optional and allows overriding the config options with
    // command line parameters.
    OrderedPartitionedKVEdgeConfig summationEdgeConf = OrderedPartitionedKVEdgeConfig.newBuilder(Text.class.getName(), IntWritable.class.getName(), HashPartitioner.class.getName()).setFromConfiguration(tezConf).build();
    // This vertex will be reading intermediate data via an input edge and writing intermediate data
    // via an output edge.
    Vertex summationVertex = Vertex.create(SUMMATION, ProcessorDescriptor.create(SumProcessor.class.getName()), numPartitions);
    // Use IntWritable key and Text value to bring all words with the same count in the same
    // partition. The data will be ordered by count and words grouped by count. The
    // setFromConfiguration call is optional and allows overriding the config options with
    // command line parameters.
    OrderedPartitionedKVEdgeConfig sorterEdgeConf = OrderedPartitionedKVEdgeConfig.newBuilder(IntWritable.class.getName(), Text.class.getName(), HashPartitioner.class.getName()).setFromConfiguration(tezConf).build();
    // Use 1 task to bring all the data in one place for global sorted order. Essentially the number
    // of partitions is 1. So the NoOpSorter can be used to produce the globally ordered output
    Vertex sorterVertex = Vertex.create(SORTER, ProcessorDescriptor.create(NoOpSorter.class.getName()), 1);
    sorterVertex.addDataSink(OUTPUT, dataSink);
    // No need to add jar containing this class as assumed to be part of the tez jars.
    DAG dag = DAG.create(dagName);
    dag.addVertex(tokenizerVertex).addVertex(summationVertex).addVertex(sorterVertex).addEdge(Edge.create(tokenizerVertex, summationVertex, summationEdgeConf.createDefaultEdgeProperty())).addEdge(Edge.create(summationVertex, sorterVertex, sorterEdgeConf.createDefaultEdgeProperty()));
    return dag;
}
Also used : OrderedPartitionedKVEdgeConfig(org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfig) Vertex(org.apache.tez.dag.api.Vertex) Configuration(org.apache.hadoop.conf.Configuration) TezConfiguration(org.apache.tez.dag.api.TezConfiguration) TextInputFormat(org.apache.hadoop.mapreduce.lib.input.TextInputFormat) TextOutputFormat(org.apache.hadoop.mapreduce.lib.output.TextOutputFormat) HashPartitioner(org.apache.tez.runtime.library.partitioner.HashPartitioner) Text(org.apache.hadoop.io.Text) DAG(org.apache.tez.dag.api.DAG) DataSinkDescriptor(org.apache.tez.dag.api.DataSinkDescriptor) IntWritable(org.apache.hadoop.io.IntWritable) DataSourceDescriptor(org.apache.tez.dag.api.DataSourceDescriptor)

Example 10 with DataSourceDescriptor

use of org.apache.tez.dag.api.DataSourceDescriptor in project tez by apache.

the class WordCount method createDAG.

private DAG createDAG(TezConfiguration tezConf, String inputPath, String outputPath, int numPartitions) throws IOException {
    // Create the descriptor that describes the input data to Tez. Using MRInput to read text
    // data from the given input path. The TextInputFormat is used to read the text data.
    DataSourceDescriptor dataSource = MRInput.createConfigBuilder(new Configuration(tezConf), TextInputFormat.class, inputPath).groupSplits(!isDisableSplitGrouping()).generateSplitsInAM(!isGenerateSplitInClient()).build();
    // Create a descriptor that describes the output data to Tez. Using MROoutput to write text
    // data to the given output path. The TextOutputFormat is used to write the text data.
    DataSinkDescriptor dataSink = MROutput.createConfigBuilder(new Configuration(tezConf), TextOutputFormat.class, outputPath).build();
    // Create a vertex that reads the data from the data source and tokenizes it using the
    // TokenProcessor. The number of tasks that will do the work for this vertex will be decided
    // using the information provided by the data source descriptor.
    Vertex tokenizerVertex = Vertex.create(TOKENIZER, ProcessorDescriptor.create(TokenProcessor.class.getName())).addDataSource(INPUT, dataSource);
    // Create the edge that represents the movement and semantics of data between the producer
    // Tokenizer vertex and the consumer Summation vertex. In order to perform the summation in
    // parallel the tokenized data will be partitioned by word such that a given word goes to the
    // same partition. The counts for the words should be grouped together per word. To achieve this
    // we can use an edge that contains an input/output pair that handles partitioning and grouping
    // of key value data. We use the helper OrderedPartitionedKVEdgeConfig to create such an
    // edge. Internally, it sets up matching Tez inputs and outputs that can perform this logic.
    // We specify the key, value and partitioner type. Here the key type is Text (for word), the
    // value type is IntWritable (for count) and we using a hash based partitioner. This is a helper
    // object. The edge can be configured by configuring the input, output etc individually without
    // using this helper. The setFromConfiguration call is optional and allows overriding the config
    // options with command line parameters.
    OrderedPartitionedKVEdgeConfig edgeConf = OrderedPartitionedKVEdgeConfig.newBuilder(Text.class.getName(), IntWritable.class.getName(), HashPartitioner.class.getName()).setFromConfiguration(tezConf).build();
    // Create a vertex that reads the tokenized data and calculates the sum using the SumProcessor.
    // The number of tasks that do the work of this vertex depends on the number of partitions used
    // to distribute the sum processing. In this case, its been made configurable via the
    // numPartitions parameter.
    Vertex summationVertex = Vertex.create(SUMMATION, ProcessorDescriptor.create(SumProcessor.class.getName()), numPartitions).addDataSink(OUTPUT, dataSink);
    // No need to add jar containing this class as assumed to be part of the Tez jars. Otherwise
    // we would have to add the jars for this code as local files to the vertices.
    // Create DAG and add the vertices. Connect the producer and consumer vertices via the edge
    DAG dag = DAG.create("WordCount");
    dag.addVertex(tokenizerVertex).addVertex(summationVertex).addEdge(Edge.create(tokenizerVertex, summationVertex, edgeConf.createDefaultEdgeProperty()));
    return dag;
}
Also used : OrderedPartitionedKVEdgeConfig(org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfig) Vertex(org.apache.tez.dag.api.Vertex) Configuration(org.apache.hadoop.conf.Configuration) TezConfiguration(org.apache.tez.dag.api.TezConfiguration) Text(org.apache.hadoop.io.Text) DAG(org.apache.tez.dag.api.DAG) DataSinkDescriptor(org.apache.tez.dag.api.DataSinkDescriptor) TextInputFormat(org.apache.hadoop.mapreduce.lib.input.TextInputFormat) TextOutputFormat(org.apache.hadoop.mapreduce.lib.output.TextOutputFormat) HashPartitioner(org.apache.tez.runtime.library.partitioner.HashPartitioner) IntWritable(org.apache.hadoop.io.IntWritable) DataSourceDescriptor(org.apache.tez.dag.api.DataSourceDescriptor)

Aggregations

DataSourceDescriptor (org.apache.tez.dag.api.DataSourceDescriptor)24 Vertex (org.apache.tez.dag.api.Vertex)14 Configuration (org.apache.hadoop.conf.Configuration)10 Path (org.apache.hadoop.fs.Path)10 DAG (org.apache.tez.dag.api.DAG)10 UserPayload (org.apache.tez.dag.api.UserPayload)10 LocalResource (org.apache.hadoop.yarn.api.records.LocalResource)8 IOException (java.io.IOException)7 FileSystem (org.apache.hadoop.fs.FileSystem)7 DataSinkDescriptor (org.apache.tez.dag.api.DataSinkDescriptor)7 TezConfiguration (org.apache.tez.dag.api.TezConfiguration)7 Test (org.junit.Test)7 IntWritable (org.apache.hadoop.io.IntWritable)5 Text (org.apache.hadoop.io.Text)5 JobConf (org.apache.hadoop.mapred.JobConf)5 InputDescriptor (org.apache.tez.dag.api.InputDescriptor)5 InputInitializerDescriptor (org.apache.tez.dag.api.InputInitializerDescriptor)5 TezUncheckedException (org.apache.tez.dag.api.TezUncheckedException)5 OrderedPartitionedKVEdgeConfig (org.apache.tez.runtime.library.conf.OrderedPartitionedKVEdgeConfig)5 TezClient (org.apache.tez.client.TezClient)4