reduceByKey or reduceByKeyLocally which should be preferred

Below are the def

def reduceByKey(partitioner: Partitioner, func: (V, V) ⇒ V): RDD[(K, V)]

def reduceByKeyLocally(func: (V, V) ⇒ V): Map[K, V]

Both are almost similar

Apr 20, 2018 in Apache Spark by Ashish
• 2,650 points • 3,023 views

1 answer to this question.

Yes, they both merge the values using an associative reduce function. reduceByKeyLocally returns the result to Master as a Map.

Now talking from a project perspective, reduceByKey the data is distributed among the cluster as it is represented as RDD. reduceByKeyLocally merges all the output to a Single Master (machine) as a Map. This completely defeats the usage of a distributed data, which is necessary while working at a large scale.

answered Apr 20, 2018 by kurt_cobain
• 9,350 points

Related Questions In Apache Spark

0 votes

1 answer

Which is better in term of speed, Shark or Spark?

Spark is a framework for distributed data ...READ MORE

answered Jun 26, 2018 in Apache Spark by nitinrawat895
• 11,380 points • 1,272 views

+1 vote

3 answers

Which cluster type should I choose for Spark?

According to me, start with a standalone ...READ MORE

answered Jun 27, 2018 in Apache Spark by nitinrawat895
• 11,380 points • 2,207 views

0 votes

2 answers

Which cluster type should I choose for Spark?

Spark is agnostic to the underlying cluster ...READ MORE

answered Aug 21, 2018 in Apache Spark by zombie
• 3,790 points • 2,510 views

0 votes

1 answer

Spark error: Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

Give read-write permissions to C:\tmp\hive folder Cd to winutils bin folder ...READ MORE

answered Jul 11, 2019 in Apache Spark by Rajiv
• 8,310 views

0 votes

1 answer

7)From Schema RDD, data can be cache by which one of the given choices?

Hi, @Ritu, According to the official documentation of Spark 1.2, ...READ MORE

answered Nov 23, 2020 in Apache Spark by Gitika
• 65,730 points • 2,362 views

0 votes

1 answer

ReduceByKey Avereage

You can try the code mentioned below ...READ MORE

answered Jan 22, 2019 in Big Data Hadoop by Omkar
• 69,180 points • 1,203 views

0 votes

1 answer

What do we exactly mean by “Hadoop” – the definition of Hadoop?

The official definition of Apache Hadoop given ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by Shubham
• 2,407 views

+1 vote

1 answer

I installed Spark but while executing command, I am getting ‘hadoop’ command not found error?

For accessing Hadoop commands & HDFS, you ...READ MORE

answered Mar 21, 2018 in Big Data Hadoop by Shubham
• 13,490 points • 3,067 views

0 votes

3 answers

Can we run Spark without using Hadoop?

No, you can run spark without hadoop. ...READ MORE

answered May 7, 2019 in Big Data Hadoop by pradeep
• 3,040 views

0 votes

1 answer

Which query to use for better performance, join in SQL or using Dataset API?

DataFrames and SparkSQL performed almost about the ...READ MORE

answered Apr 19, 2018 in Apache Spark by kurt_cobain
• 9,350 points • 2,433 views

Subscribe to our Newsletter, and get personalized recommendations.

REGISTER FOR FREE WEBINAR

Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP