reduceByKey or reduceByKeyLocally , which should be preferred ?

0 votes

Below are the def 

def reduceByKey(partitioner: Partitioner, func: (V, V) ⇒ V): RDD[(K, V)]


def reduceByKeyLocally(func: (V, V) ⇒ V): Map[K, V]

Both are almost similar

Apr 20, 2018 in Apache Spark by Ashish
• 2,630 points
503 views

1 answer to this question.

0 votes
Yes, they both merge the values using an associative reduce function. reduceByKeyLocally returns the result to Master as a Map.

Now talking from a project perspective, reduceByKey the data is distributed among the cluster as it is represented as RDD. reduceByKeyLocally merges all the output to a Single Master (machine) as a Map. This completely defeats the usage of a distributed data, which is necessary while working at a large scale.
answered Apr 20, 2018 by kurt_cobain
• 9,240 points

Related Questions In Apache Spark

0 votes
1 answer

Which is better in term of speed, Shark or Spark?

Spark is a framework for distributed data ...READ MORE

answered Jun 25, 2018 in Apache Spark by nitinrawat895
• 10,670 points
33 views
+1 vote
3 answers

Which cluster type should I choose for Spark?

According to me, start with a standalone ...READ MORE

answered Jun 27, 2018 in Apache Spark by nitinrawat895
• 10,670 points
133 views
0 votes
2 answers

Which cluster type should I choose for Spark?

Spark is agnostic to the underlying cluster ...READ MORE

answered Aug 21, 2018 in Apache Spark by zombie
• 3,690 points
127 views
0 votes
1 answer

Spark error: Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.

Give  read-write permissions to  C:\tmp\hive folder Cd to winutils bin folder ...READ MORE

answered Jul 11 in Apache Spark by Rajiv
89 views
0 votes
1 answer

Is it better to have one large parquet file or lots of smaller parquet files?

Ideally, you would use snappy compression (default) ...READ MORE

answered May 23, 2018 in Apache Spark by nitinrawat895
• 10,670 points
1,905 views
0 votes
1 answer

ReduceByKey Avereage

You can try the code mentioned below ...READ MORE

answered Jan 21 in Big Data Hadoop by Omkar
• 67,460 points
25 views
0 votes
1 answer

What do we exactly mean by “Hadoop” – the definition of Hadoop?

The official definition of Apache Hadoop given ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by Shubham
186 views
+1 vote
1 answer
0 votes
3 answers

Can we run Spark without using Hadoop?

No, you can run spark without hadoop. ...READ MORE

answered May 7 in Big Data Hadoop by pradeep
164 views
0 votes
1 answer

Which query to use for better performance, join in SQL or using Dataset API?

DataFrames and SparkSQL performed almost about the ...READ MORE

answered Apr 19, 2018 in Apache Spark by kurt_cobain
• 9,240 points
107 views