Can anyone explain fold() operation in Spark?

0 votes
Jul 26, 2018 in Apache Spark by shams
• 3,580 points
3,662 views

3 answers to this question.

0 votes
  • fold() is an action. It is wide operation (i.e. shuffle data across multiple partitions and output a single value)
  • It takes function as an input which has two parameters of the same type and outputs a single value of the input type.
  • It is similar to reduce but has one more argument 'ZERO VALUE' (say initial value) which will be used in the initial call on each partition.

def fold(zeroValue: T)(op: (T, T) ⇒ T): T

Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.

zeroValue: The initial value for the accumulated result of each partition for the op operator, and also the initial value for the combine results from different partitions for the op operator - this will typically be the neutral element (e.g. Nil for list concatenation or 0 for summation)
Op: an operator used to both accumulate results within a partition and combine results from different partitions

Example :
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
rdd1.fold(5)(_+_)

Output :
Int = 35

answered Jul 26, 2018 by zombie
• 3,690 points
0 votes

Fold in spark:

Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on a collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.

Syntax
def fold[T](acc:T)((acc,value) => acc)
The above is kind of high-level view of fold API. It has the following three things:

  • T is the data type of RDD
  • acc is accumulator of type T which will be return value of the fold operation
  • A function , which will be called for each element in rdd with previous accumulator

Examples of fold:

Finding max in a given RDD


Let’s first build a RDD

val sparkContext = new SparkContext("local", "functional")
val employeeData = List(("Jack",1000.0),("Bob",2000.0),("Carl",7000.0))
val employeeRDD = sparkContext.makeRDD(employeeData)


Now we want to find an employee, with maximum salary. We can do that using fold.

To use fold we need a start value. The following code defines a dummy employee as starting accumulator.

val dummyEmployee = ("dummy",0.0);


Now using fold, we can find the employee with maximum salary.

val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => { 
if(acc._2 < employee._2) employee else acc})
println("employee with maximum salary is"+maxSalaryEmployee)

answered Aug 2, 2018 by nitinrawat895
• 10,710 points
0 votes

Fold in spark

  • Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. 
  • If you are familiar with Scala collection it will be like using fold operation on collection. 
  • Even if you not used fold in Scala, this post will make you comfortable in using fold.

Syntax

def fold[T](acc:T)((acc,value) => acc)
The above is kind of high level view of fold api. It has following three things

  • T is the data type of RDD
  • acc is accumulator of type T which will be return value of the fold operation
  • A function , which will be called for each element in rdd with previous accumulator.

Example:

val sparkContext = new SparkContext("local", "functional")
val employeeData = List(("Jack",1000.0),("Bob",2000.0),("Carl",7000.0))
val employeeRDD = sparkContext.makeRDD(employeeData)
val dummyEmployee = ("dummy",0.0)
val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => { 
if(acc._2 < employee._2) employee else acc})
println("employee with maximum salary is"+maxSalaryEmployee)​

answered Aug 22, 2018 by samarth295
• 2,190 points

Related Questions In Apache Spark

0 votes
1 answer

Can anyone explain what is RDD in Spark?

RDD is a fundamental data structure of ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 13,300 points
626 views
0 votes
1 answer

Can anyone explain the sparse vector in Spark?

Hey, A sparse vector is used for storing ...READ MORE

answered Aug 2 in Apache Spark by Gitika
• 25,340 points
115 views
0 votes
2 answers

In a Spark DataFrame how can I flatten the struct?

// Collect data from input avro file ...READ MORE

answered Jul 4 in Apache Spark by Dhara dhruve
1,180 views
0 votes
1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 13,300 points
1,533 views
0 votes
1 answer

where can i get spark-terasort.jar and not .scala file, to do spark terasort in windows.

Hi! I found 2 links on github where ...READ MORE

answered Feb 13 in Apache Spark by Omkar
• 67,660 points
132 views
0 votes
1 answer

Can I set different protocol for SSL in Spark?

There is no protocol set by default. ...READ MORE

answered Mar 15 in Apache Spark by Karan
60 views
0 votes
1 answer

Spark: How can i create temp views in user defined database instead of default database?

You can try the below code: df.registerTempTable(“airports”) sqlContext.sql(" create ...READ MORE

answered Jul 14 in Apache Spark by Ishan
98 views
0 votes
1 answer

Explain the for loop for printing the Map values in Scala in Apache Spark?

Hey, You can see this following code to ...READ MORE

answered Jul 22 in Apache Spark by Gitika
• 25,340 points
51 views
0 votes
1 answer

How is RDD in Spark different from Distributed Storage Management? Can anyone help me with this ?

Some of the key differences between an RDD and ...READ MORE

answered Jul 26, 2018 in Apache Spark by zombie
• 3,690 points
176 views
0 votes
3 answers

Lineage Graph in Spark

Whenever a series of transformations are performed ...READ MORE

answered Aug 27, 2018 in Apache Spark by shams
• 3,580 points
2,095 views