Spark - repartition() vs coalesce()

0 votes
One difference I know is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

What if the partitions are spread across multiple machines and coalesce() is run, how can it avoid data movement?

Can someone help!
Oct 11, 2018 in Apache Spark by Meci Matt
• 9,400 points
1,587 views

1 answer to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes
It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

So, it would go something like this:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12
Then coalesce down to 2 partitions:

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)
Notice that Node 1 and Node 3 did not require its original data to move.
answered Oct 11, 2018 by nitinrawat895
• 9,070 points

Related Questions In Apache Spark

+1 vote
2 answers

Apache Spark vs Apache Spark 2

Spark 2 doesn't differ much architecture-wise from ...READ MORE

answered Apr 24, 2018 in Apache Spark by kurt_cobain
• 9,260 points
2,208 views
0 votes
5 answers

groupByKey vs reduceByKey in Apache Spark.

Below Images are self explainatry for reducebykey ...READ MORE

answered Apr 22 in Apache Spark by Gunjan Kumar
3,688 views
0 votes
1 answer

Cache() vs persist() in Spark

The cache() is used only the default storage level ...READ MORE

answered Mar 8 in Apache Spark by Raj
55 views
0 votes
1 answer

map() vs flatMap() in Spark

Both map() and flatMap() are used for ...READ MORE

answered Mar 8 in Apache Spark by Raj
185 views
+1 vote
1 answer
0 votes
1 answer

Writing File into HDFS using spark scala

The reason you are not able to ...READ MORE

answered Apr 5, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
3,424 views
0 votes
1 answer

Is there any way to check the Spark version?

There are 2 ways to check the ...READ MORE

answered Apr 19, 2018 in Apache Spark by nitinrawat895
• 9,070 points
509 views
0 votes
1 answer

What's the difference between 'filter' and 'where' in Spark SQL?

Both 'filter' and 'where' in Spark SQL ...READ MORE

answered May 23, 2018 in Apache Spark by nitinrawat895
• 9,070 points
2,963 views
0 votes
1 answer

What is the difference between Apache Spark SQLContext vs HiveContext?

Spark 2.0+ Spark 2.0 provides native window functions ...READ MORE

answered May 25, 2018 in Apache Spark by nitinrawat895
• 9,070 points
1,453 views
0 votes
1 answer

map vs mapValues in Spark

There is a difference between the two: mapValues ...READ MORE

answered Jun 29, 2018 in Apache Spark by nitinrawat895
• 9,070 points
1,420 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.