groupByKey vs reduceByKey in Apache Spark

+1 vote
Which is better groupByKey or reduceByKey ?
Jul 27, 2018 in Apache Spark by shams
• 3,670 points
76,910 views

6 answers to this question.

0 votes

On applying groupByKey() on a dataset of (K, V) pairs, the data shuffle according to the key value K in another RDD. In this transformation, lots of unnecessary data transfer over the network.

Spark provides the provision to save data to disk when there is more data shuffling onto a single executor machine than can fit in memory.

Example:

val data = spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)

val group = data.groupByKey().collect()

group.foreach(println)

On applying reduceByKey on a dataset (K, V), before shuffeling of data the pairs on the same machine with the same key are combined.

Example:

val words = Array("one","two","two","four","five","six","six","eight","nine","ten")

val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)

data.collect.foreach(println)

You can even check out the details of a successful Spark developers with the Pyspark online course

answered Jul 27, 2018 by zombie
• 3,790 points
+1 vote

groupByKey:

Syntax:

sparkContext.textFile("hdfs://")
                    .flatMap(line => line.split(" ") )
                    .map(word => (word,1))
                    .groupByKey()
                    .map((x,y) => (x,sum(y)) )

groupByKey can cause out of disk problems as data is sent over the network and collected on the reduce workers.

reduceByKey:

Syntax:

sparkContext.textFile("hdfs://")
                    .flatMap(line => line.split(" "))
                    .map(word => (word,1))
                    .reduceByKey((x,y)=> (x+y))

Data is combined at each partition , only one output for one key at each partition to send over network. reduceByKey required combining all your values into another value with the exact same type.

answered Aug 3, 2018 by nitinrawat895
• 11,380 points
+1 vote

There is two different ways to compute counts:

val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))

val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect()
val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect()

reduceByKey will aggregate y key before shuffling, and groupByKey will shuffle all the value key pairs as the diagrams show. On large size data the difference is obvious.

answered Aug 23, 2018 by samarth295
• 2,220 points
+1 vote
ReduceByKey is the best for production.
answered Mar 3, 2019 by anonymous
Could you please explain why?
0 votes

Below Images are self explainatry for reducebykey and groupbykey 

answered Apr 23, 2019 by Gunjan Kumar
Thanks @Gunjan. Could you please tell me when it is better to use ReduceByKey and GroupByKey?
0 votes

Hi,

The groupByKey can cause out of disk problems as data is sent over the network and collected on the reduced workers. You can see the below example.

sparkContext.textFile("hdfs://")
                    .flatMap(line => line.split(" ") )
                    .map(word => (word,1))
                    .groupByKey()
                    .map((x,y) => (x,sum(y)))

Whereas in reducebykey, Data are combined at each partition, only one output for one key at each partition to send over the network. reduceByKey required combining all your values into another value with the exact same type.

sparkContext.textFile("hdfs://")
                    .flatMap(line => line.split(" "))
                    .map(word => (word,1))
                    .reduceByKey((x,y)=> (x+y))
answered Dec 15, 2020 by MD
• 95,460 points

Related Questions In Apache Spark

+1 vote
2 answers

Apache Spark vs Apache Spark 2

Spark 2 doesn't differ much architecture-wise from ...READ MORE

answered Apr 24, 2018 in Apache Spark by kurt_cobain
• 9,350 points
9,255 views
+5 votes
11 answers

Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

answered Mar 21, 2019 in Apache Spark by anonymous
72,480 views
0 votes
1 answer

cache tables in apache spark sql

Caching the tables puts the whole table ...READ MORE

answered May 4, 2018 in Apache Spark by Data_Nerd
• 2,390 points
3,346 views
+1 vote
1 answer

map vs mapValues in Spark

There is a difference between the two: mapValues ...READ MORE

answered Jun 29, 2018 in Apache Spark by nitinrawat895
• 11,380 points
16,091 views
+1 vote
8 answers

How to print the contents of RDD in Apache Spark?

Save it to a text file: line.saveAsTextFile("alicia.txt") Print contains ...READ MORE

answered Dec 10, 2018 in Apache Spark by Akshay
61,872 views
+1 vote
3 answers

What is the difference between rdd and dataframes in Apache Spark ?

Comparison between Spark RDD vs DataFrame 1. Release ...READ MORE

answered Aug 28, 2018 in Apache Spark by shams
• 3,670 points
43,132 views
0 votes
1 answer

Cache() vs persist() in Spark

The cache() is used only the default storage level ...READ MORE

answered Mar 8, 2019 in Apache Spark by Raj
10,959 views
+1 vote
3 answers

map() vs flatMap() in Spark

Spark map function expresses a one-to-one transformation. ...READ MORE

answered Jun 17, 2019 in Apache Spark by vishal
• 180 points
38,827 views
0 votes
1 answer

What is the difference between Apache Spark SQLContext vs HiveContext?

Spark 2.0+ Spark 2.0 provides native window functions ...READ MORE

answered May 26, 2018 in Apache Spark by nitinrawat895
• 11,380 points
4,618 views
0 votes
1 answer

Ways to create RDD in Apache Spark

There are two popular ways using which ...READ MORE

answered Jun 19, 2018 in Apache Spark by nitinrawat895
• 11,380 points
4,076 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP