How does the number of groups affect the cost of the shuffling phase?

0 votes

In the Coursera course Hadoop Platform and Application Framework (week 4, lesson 2), it is stated that the cost of shuffling depends on the number of groups, i.e. the number of distinct keys. Therefore, it was suggested that keys be merged into bins to reduce the cost of the shuffling phase. I would like to understand how exactly the cost of shuffling depends on the number of groups. 

Is would seem that the amount of data that needs to be transferred to the reducers is the same with or without bins. I realize that the number of transfers is reduced by using bins, so that some components of the costs of the network transfers are paid fewer times. But isn't Hadoop smart enough to transfer at once all the keys destined to the same reducer?

Jan 22 in Big Data Hadoop by anonymous

edited Jan 22 24 views

1 answer to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

Shuffle happens with key-value pairs. So, when you merge the keys, the number of shuffles required will also decrease and hence decreases the cost. When merging less number of keys, significant change might not be observed but when you merge a large number of keys, the cost will drastically reduce. 

Here, you can see how the number of tasks is reduced when the keys are merged. Hadoop only knows to handle the key-value pairs during shuffle so it wouldn't understand to transfer at once all the keys destined to the same reducer.

answered Jan 22 by Omkar
• 65,850 points

Related Questions In Big Data Hadoop

0 votes
1 answer

How to set the number of Map & Reduce tasks?

The map tasks created for a job ...READ MORE

answered Apr 18, 2018 in Big Data Hadoop by Shubham
• 12,230 points
39 views
0 votes
1 answer

How does the HDFS Client knows the block size while writing?

HDFS is designed in a way where ...READ MORE

answered Mar 27, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
15 views
0 votes
1 answer

How to print the content of a file in console present in HDFS?

Yes, you can use hdfs dfs command ...READ MORE

answered Apr 19, 2018 in Big Data Hadoop by Shubham
• 12,230 points
252 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 9,030 points
1,666 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 9,030 points
130 views
0 votes
10 answers

hadoop fs -put command?

copy command can be used to copy files ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Sujay
8,088 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
567 views
0 votes
1 answer

How to find the number of blocks a hdfs file is divided into?

Yes. you can use the hadoop fsck command to do ...READ MORE

answered Nov 29, 2018 in Big Data Hadoop by Omkar
• 65,850 points
123 views
0 votes
1 answer

How to limit the number of rows per each item in a Hive QL?

SELECT a_id, b, c, count(*) as sumrequests FROM ...READ MORE

answered Nov 30, 2018 in Big Data Hadoop by Omkar
• 65,850 points
147 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.