Spark - repartition vs coalesce

Question

One difference I know is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

What if the partitions are spread across multiple machines and coalesce() is run, how can it avoid data movement?

Can someone help!

nitinrawat895 · Answer 1 · Oct 11, 2018

It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

So, it would go something like this:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12
Then coalesce down to 2 partitions:

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)
Notice that Node 1 and Node 3 did not require its original data to move.