groupByKey vs reduceByKey in Apache Spark.

0 votes
Which is better groupByKey or reduceByKey ?
Jul 26, 2018 in Apache Spark by shams
• 3,580 points
13,489 views

5 answers to this question.

0 votes

On applying groupByKey() on a dataset of (K, V) pairs, the data shuffle according to the key value K in another RDD. In this transformation, lots of unnecessary data transfer over the network.

Spark provides the provision to save data to disk when there is more data shuffling onto a single executor machine than can fit in memory.

Example:

val data = spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)

val group = data.groupByKey().collect()

group.foreach(println)

On applying reduceByKey on a dataset (K, V), before shuffeling of data the pairs on the same machine with the same key are combined.

Example:

val words = Array("one","two","two","four","five","six","six","eight","nine","ten")

val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)

data.collect.foreach(println)

answered Jul 26, 2018 by zombie
• 3,710 points
0 votes

groupByKey:

Syntax:

sparkContext.textFile("hdfs://")
                    .flatMap(line => line.split(" ") )
                    .map(word => (word,1))
                    .groupByKey()
                    .map((x,y) => (x,sum(y)) )

groupByKey can cause out of disk problems as data is sent over the network and collected on the reduce workers.

reduceByKey:

Syntax:

sparkContext.textFile("hdfs://")
                    .flatMap(line => line.split(" "))
                    .map(word => (word,1))
                    .reduceByKey((x,y)=> (x+y))

Data is combined at each partition , only one output for one key at each partition to send over network. reduceByKey required combining all your values into another value with the exact same type.

answered Aug 2, 2018 by nitinrawat895
• 10,760 points
0 votes

There is two different ways to compute counts:

val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))

val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect()
val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect()

reduceByKey will aggregate y key before shuffling, and groupByKey will shuffle all the value key pairs as the diagrams show. On large size data the difference is obvious.

answered Aug 22, 2018 by samarth295
• 2,190 points
0 votes
ReduceByKey is the best for production.
answered Mar 3 by anonymous
Could you please explain why?
0 votes

Below Images are self explainatry for reducebykey and groupbykey 

answered Apr 22 by Gunjan Kumar
Thanks @Gunjan. Could you please tell me when it is better to use ReduceByKey and GroupByKey?

Related Questions In Apache Spark

+5 votes
11 answers

Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

answered Mar 21 in Apache Spark by anonymous
35,864 views
0 votes
1 answer

cache tables in apache spark sql

Caching the tables puts the whole table ...READ MORE

answered May 4, 2018 in Apache Spark by Data_Nerd
• 2,360 points
931 views
0 votes
1 answer

What is the difference between Apache Spark SQLContext vs HiveContext?

Spark 2.0+ Spark 2.0 provides native window functions ...READ MORE

answered May 25, 2018 in Apache Spark by nitinrawat895
• 10,760 points
2,205 views
0 votes
1 answer

Ways to create RDD in Apache Spark

There are two popular ways using which ...READ MORE

answered Jun 19, 2018 in Apache Spark by nitinrawat895
• 10,760 points
1,971 views
0 votes
1 answer

map vs mapValues in Spark

There is a difference between the two: mapValues ...READ MORE

answered Jun 29, 2018 in Apache Spark by nitinrawat895
• 10,760 points
3,357 views
+1 vote
3 answers

What is the difference between rdd and dataframes in Apache Spark ?

Comparison between Spark RDD vs DataFrame 1. Release ...READ MORE

answered Aug 27, 2018 in Apache Spark by shams
• 3,580 points
18,005 views
0 votes
1 answer

Cache() vs persist() in Spark

The cache() is used only the default storage level ...READ MORE

answered Mar 8 in Apache Spark by Raj
1,119 views
0 votes
2 answers

map() vs flatMap() in Spark

Spark map function expresses a one-to-one transformation. ...READ MORE

answered Jun 17 in Apache Spark by vishal
• 160 points
4,691 views
+1 vote
2 answers

Apache Spark vs Apache Spark 2

Spark 2 doesn't differ much architecture-wise from ...READ MORE

answered Apr 24, 2018 in Apache Spark by kurt_cobain
• 9,280 points
4,094 views
0 votes
7 answers

How to print the contents of RDD in Apache Spark?

Simple and easy: line.foreach(println) READ MORE

answered Dec 10, 2018 in Apache Spark by Kuber
14,008 views