How to set up Map and Reduce Tasks

Question

I have set up a Map tasks as 20 for a particular job to be executed and also I have set the Reduce task to 0. The issue is, my Mapper count and Reducer counts are irregular, the Mappers are more than 20 and Reducers are also more than. the MapReduce task got executed successfully but the execution time is not being displayed. I don't know where the issue is lying. Please help me out on this one.

This is my code:

hadoop jar Test_Parallel_for.jar Test_Parallel_for Matrix/test4.txt Result 3 \ -D mapred.map.tasks = 20 \ -D mapred.reduce.tasks =0

Output:

11/07/30 19:48:56 INFO mapred.JobClient: Job complete: job_201107291018_0164
11/07/30 19:48:56 INFO mapred.JobClient: Counters: 18
11/07/30 19:48:56 INFO mapred.JobClient:   Job Counters 
11/07/30 19:48:56 INFO mapred.JobClient:     Launched reduce tasks=13
11/07/30 19:48:56 INFO mapred.JobClient:     Rack-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient:     Launched map tasks=24
11/07/30 19:48:56 INFO mapred.JobClient:     Data-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient:   FileSystemCounters
11/07/30 19:48:56 INFO mapred.JobClient:     FILE_BYTES_READ=4020792636
11/07/30 19:48:56 INFO mapred.JobClient:     HDFS_BYTES_READ=1556534680
11/07/30 19:48:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6026699058
11/07/30 19:48:56 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1928893942
11/07/30 19:48:56 INFO mapred.JobClient:   Map-Reduce Framework
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce input groups=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Combine output records=0
11/07/30 19:48:56 INFO mapred.JobClient:     Map input records=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce shuffle bytes=1974162269
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Spilled Records=120000000
11/07/30 19:48:56 INFO mapred.JobClient:     Map output bytes=1928893942
11/07/30 19:48:56 INFO mapred.JobClient:     Combine input records=0
11/07/30 19:48:56 INFO mapred.JobClient:     Map output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient:     Reduce input records=40000000
[hcrc1425n30]s0907855:

ravikiran · Answer 1 · Jul 12, 2019

when using the basic FileInputFormat classes is just the number of input splits that constitute the data. The number of reducers is controlled by MapRed.reduce.tasksspecified in the way you have it: -D MapRed.reduce.tasks=10 would specify 10 reducers. Note that space after -D is required; if you omit the space, the configuration property is passed along to the relevant JVM, not to Hadoop

If you are specifying Reducers to 0, it means you might not have the requirement for Reducers in your task. In that case, if you're having trouble with the run-time parameter, you can also set the value directly in code. Given a JobConf instance job, call

job.setNumReduceTasks(0);

inside, say, your implementation of Tool.run. That should produce output directly from the mappers. If your job actually produces no output whatsoever (because you're using the framework just for side-effects like network calls or image processing, or if the results are entirely accounted for in Counter values), you can disable output by also calling

job.setOutputFormat(NullOutputFormat.class);

Hope this helps.

answered Jul 12, 2019 by ravikiran
• 4,620 points

score 0 · Answer 2 · Aug 5, 2019

Hi,

The number of map tasks for a given job is driven by the number of input splits and not by the mapred.map.tasks parameter. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits. mapred.map.tasks is just a hint to the InputFormat for the number of maps.

In your example Hadoop has determined there are 24 input splits and will spawn 24 map tasks in total. But, you can control how many map tasks can be executed in parallel by each of the task tracker.

Also, removing a space after -D might solve the problem for reduce.