How does the number of groups affect the cost of the shuffling phase

Question

In the Coursera course Hadoop Platform and Application Framework (week 4, lesson 2), it is stated that the cost of shuffling depends on the number of groups, i.e. the number of distinct keys. Therefore, it was suggested that keys be merged into bins to reduce the cost of the shuffling phase. I would like to understand how exactly the cost of shuffling depends on the number of groups.

Is would seem that the amount of data that needs to be transferred to the reducers is the same with or without bins. I realize that the number of transfers is reduced by using bins, so that some components of the costs of the network transfers are paid fewer times. But isn't Hadoop smart enough to transfer at once all the keys destined to the same reducer?

Omkar · Answer 1 · Jan 22, 2019

Shuffle happens with key-value pairs. So, when you merge the keys, the number of shuffles required will also decrease and hence decreases the cost. When merging less number of keys, significant change might not be observed but when you merge a large number of keys, the cost will drastically reduce.

Here, you can see how the number of tasks is reduced when the keys are merged. Hadoop only knows to handle the key-value pairs during shuffle so it wouldn't understand to transfer at once all the keys destined to the same reducer.