Can anyone explain fold() operation in Spark?

Question

zombie · Answer

fold() is an action. It is wide operation (i.e. shuffle data across multiple partitions and output a single value)It takes function as an input which has two parameters of the same type and outputs a single value of the input type.It is similar to reduce but has one more argument 'ZERO VALUE' (say initial value) which will be used in the initial call on each partition.def fold(zeroValue: T)(op: (T, T) &#8658; T): TAggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.zeroValue: The initial value for the accumulated result of each partition for the op operator, and also the initial value for the combine results from different partitions for the op operator - this will typically be the neutral element (e.g. Nil for list concatenation or 0 for summation)Op: an operator used to both accumulate results within a partition and combine results from different partitionsExample :val rdd1 = sc.parallelize(List(1,2,3,4,5),3)rdd1.fold(5)(_+_)Output :Int = 35

nitinrawat895 · Answer

Fold in spark:Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on a collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.Syntaxdef fold[T](acc:T)((acc,value) => acc)The above is kind of high-level view of fold API. It has the following three things:T is the data type of RDDacc is accumulator of type T which will be return value of the fold operationA function , which will be called for each element in rdd with previous accumulatorExamples of fold:Finding max in a given RDDLet&#8217;s first build a RDDval sparkContext = new SparkContext("local", "functional")val employeeData = List(("Jack",1000.0),("Bob",2000.0),("Carl",7000.0))val employeeRDD = sparkContext.makeRDD(employeeData)Now we want to find an employee, with maximum salary. We can do that using fold.To use fold we need a start value. The following code defines a dummy employee as starting accumulator.val dummyEmployee = ("dummy",0.0);Now using fold, we can find the employee with maximum salary.val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => {&#160;if(acc._2 < employee._2) employee else acc})println("employee with maximum salary is"+maxSalaryEmployee)

samarth295 · Answer

Fold in sparkFold is a very powerful operation in spark which allows you to calculate many important values in O(n) time.&#160;If you are familiar with Scala collection it will be like using fold operation on collection.&#160;Even if you not used fold in Scala, this post will make you comfortable in using fold.Syntaxdef fold[T](acc:T)((acc,value) => acc)The above is kind of high level view of fold api. It has following three thingsT is the data type of RDDacc is accumulator of type T which will be return value of the fold operationA function , which will be called for each element in rdd with previous accumulator.Example:val sparkContext = new SparkContext("local", "functional")val employeeData = List(("Jack",1000.0),("Bob",2000.0),("Carl",7000.0))val employeeRDD = sparkContext.makeRDD(employeeData)val dummyEmployee = ("dummy",0.0)val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => {&#160;if(acc._2 < employee._2) employee else acc})println("employee with maximum salary is"+maxSalaryEmployee)&#8203;

Can anyone explain fold operation in Spark

Your comment on this question:

3 answers to this question.

Your answer

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Related Questions In Apache Spark

Can anyone explain what is RDD in Spark?

Can anyone explain the sparse vector in Spark?

In a Spark DataFrame how can I flatten the struct?

How can I write a text file in HDFS not from an RDD, in Spark program?

where can i get spark-terasort.jar and not .scala file, to do spark terasort in windows.

Can I set different protocol for SSL in Spark?

Spark: How can i create temp views in user defined database instead of default database?

Explain the for loop for printing the Map values in Scala in Apache Spark?

How is RDD in Spark different from Distributed Storage Management? Can anyone help me with this ?

Lineage Graph in Spark

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES