Fold in spark:
Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on a collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.
Syntax
def fold[T](acc:T)((acc,value) => acc)
The above is kind of high-level view of fold API. It has the following three things:
- T is the data type of RDD
- acc is accumulator of type T which will be return value of the fold operation
- A function , which will be called for each element in rdd with previous accumulator
Examples of fold:
Finding max in a given RDD
Let’s first build a RDD
val sparkContext = new SparkContext("local", "functional")
val employeeData = List(("Jack",1000.0),("Bob",2000.0),("Carl",7000.0))
val employeeRDD = sparkContext.makeRDD(employeeData)
Now we want to find an employee, with maximum salary. We can do that using fold.
To use fold we need a start value. The following code defines a dummy employee as starting accumulator.
val dummyEmployee = ("dummy",0.0);
Now using fold, we can find the employee with maximum salary.
val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => {
if(acc._2 < employee._2) employee else acc})
println("employee with maximum salary is"+maxSalaryEmployee)