How does the number of groups affect the cost of the shuffling phase

0 votes
In the Coursera course Hadoop Platform and Application Framework (week 4, lesson 2), it is stated that the cost of shuffling depends on the number of groups, i.e. the number of distinct keys. Therefore, it was suggested that keys be merged into bins to reduce the cost of the shuffling phase. I would like to understand how exactly the cost of shuffling depends on the number of groups.

Is would seem that the amount of data that needs to be transferred to the reducers is the same with or without bins. I realize that the number of transfers is reduced by using bins, so that some components of the costs of the network transfers are paid fewer times. But isn't Hadoop smart enough to transfer at once all the keys destined to the same reducer?
Jan 22, 2019 in Big Data Hadoop by anonymous

edited Jan 22, 2019 754 views

1 answer to this question.

0 votes

Shuffle happens with key-value pairs. So, when you merge the keys, the number of shuffles required will also decrease and hence decreases the cost. When merging less number of keys, significant change might not be observed but when you merge a large number of keys, the cost will drastically reduce. 

Here, you can see how the number of tasks is reduced when the keys are merged. Hadoop only knows to handle the key-value pairs during shuffle so it wouldn't understand to transfer at once all the keys destined to the same reducer.

answered Jan 22, 2019 by Omkar
• 69,220 points

Related Questions In Big Data Hadoop

0 votes
1 answer

How to set the number of Map & Reduce tasks?

The map tasks created for a job ...READ MORE

answered Apr 18, 2018 in Big Data Hadoop by Shubham
• 13,490 points
1,944 views
0 votes
1 answer

How to Change the maximum number of cells of a column family?

Hey, Given below is the syntax to change ...READ MORE

answered May 24, 2019 in Big Data Hadoop by Gitika
• 65,770 points
1,497 views
0 votes
1 answer

How Hadoop scalabality linear or proportional depends on the number of nodes?

Hey, These jobs are often IO based not ...READ MORE

answered May 28, 2019 in Big Data Hadoop by Gitika
• 65,770 points
573 views
0 votes
1 answer

How to Modify the Maximum Number of Versions for a Column Family in Hbase?

Hey, The example uses HBase Shell to keep ...READ MORE

answered May 31, 2019 in Big Data Hadoop by Gitika
• 65,770 points
3,639 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,072 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,571 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
109,055 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
4,639 views
0 votes
1 answer

How to find the number of blocks a hdfs file is divided into?

Yes. you can use the hadoop fsck command to do ...READ MORE

answered Nov 30, 2018 in Big Data Hadoop by Omkar
• 69,220 points
5,947 views
+1 vote
1 answer

How to limit the number of rows per each item in a Hive QL?

SELECT a_id, b, c, count(*) as sumrequests FROM ...READ MORE

answered Dec 1, 2018 in Big Data Hadoop by Omkar
• 69,220 points
27,080 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP