How does the number of groups affect the cost of the shuffling phase?

0 votes

In the Coursera course Hadoop Platform and Application Framework (week 4, lesson 2), it is stated that the cost of shuffling depends on the number of groups, i.e. the number of distinct keys. Therefore, it was suggested that keys be merged into bins to reduce the cost of the shuffling phase. I would like to understand how exactly the cost of shuffling depends on the number of groups. 

Is would seem that the amount of data that needs to be transferred to the reducers is the same with or without bins. I realize that the number of transfers is reduced by using bins, so that some components of the costs of the network transfers are paid fewer times. But isn't Hadoop smart enough to transfer at once all the keys destined to the same reducer?

Jan 22 in Big Data Hadoop by anonymous

edited Jan 22 33 views

1 answer to this question.

0 votes

Shuffle happens with key-value pairs. So, when you merge the keys, the number of shuffles required will also decrease and hence decreases the cost. When merging less number of keys, significant change might not be observed but when you merge a large number of keys, the cost will drastically reduce. 

Here, you can see how the number of tasks is reduced when the keys are merged. Hadoop only knows to handle the key-value pairs during shuffle so it wouldn't understand to transfer at once all the keys destined to the same reducer.

answered Jan 22 by Omkar
• 67,480 points

Related Questions In Big Data Hadoop

0 votes
1 answer

How to set the number of Map & Reduce tasks?

The map tasks created for a job ...READ MORE

answered Apr 18, 2018 in Big Data Hadoop by Shubham
• 13,290 points
68 views
0 votes
1 answer

How to Change the maximum number of cells of a column family?

Hey, Given below is the syntax to change ...READ MORE

answered May 24 in Big Data Hadoop by Gitika
• 25,340 points
20 views
0 votes
1 answer

How Hadoop scalabality linear or proportional depends on the number of nodes?

Hey, These jobs are often IO based not ...READ MORE

answered May 28 in Big Data Hadoop by Gitika
• 25,340 points
33 views
0 votes
1 answer

How to Modify the Maximum Number of Versions for a Column Family in Hbase?

Hey, The example uses HBase Shell to keep ...READ MORE

answered May 31 in Big Data Hadoop by Gitika
• 25,340 points
37 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,670 points
2,736 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,670 points
289 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
13,562 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
993 views
0 votes
1 answer

How to find the number of blocks a hdfs file is divided into?

Yes. you can use the hadoop fsck command to do ...READ MORE

answered Nov 29, 2018 in Big Data Hadoop by Omkar
• 67,480 points
427 views
0 votes
1 answer

How to limit the number of rows per each item in a Hive QL?

SELECT a_id, b, c, count(*) as sumrequests FROM ...READ MORE

answered Nov 30, 2018 in Big Data Hadoop by Omkar
• 67,480 points
940 views