Can anyone explain fold() operation in Spark?

0 votes
Jul 26, 2018 in Apache Spark by shams
• 3,580 points
1,967 views

3 answers to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes
  • fold() is an action. It is wide operation (i.e. shuffle data across multiple partitions and output a single value)
  • It takes function as an input which has two parameters of the same type and outputs a single value of the input type.
  • It is similar to reduce but has one more argument 'ZERO VALUE' (say initial value) which will be used in the initial call on each partition.

def fold(zeroValue: T)(op: (T, T) ⇒ T): T

Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.

zeroValue: The initial value for the accumulated result of each partition for the op operator, and also the initial value for the combine results from different partitions for the op operator - this will typically be the neutral element (e.g. Nil for list concatenation or 0 for summation)
Op: an operator used to both accumulate results within a partition and combine results from different partitions

Example :
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
rdd1.fold(5)(_+_)

Output :
Int = 35

answered Jul 26, 2018 by zombie
• 3,690 points
0 votes

Fold in spark:

Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on a collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.

Syntax
def fold[T](acc:T)((acc,value) => acc)
The above is kind of high-level view of fold API. It has the following three things:

  • T is the data type of RDD
  • acc is accumulator of type T which will be return value of the fold operation
  • A function , which will be called for each element in rdd with previous accumulator

Examples of fold:

Finding max in a given RDD


Let’s first build a RDD

val sparkContext = new SparkContext("local", "functional")
val employeeData = List(("Jack",1000.0),("Bob",2000.0),("Carl",7000.0))
val employeeRDD = sparkContext.makeRDD(employeeData)


Now we want to find an employee, with maximum salary. We can do that using fold.

To use fold we need a start value. The following code defines a dummy employee as starting accumulator.

val dummyEmployee = ("dummy",0.0);


Now using fold, we can find the employee with maximum salary.

val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => { 
if(acc._2 < employee._2) employee else acc})
println("employee with maximum salary is"+maxSalaryEmployee)

answered Aug 2, 2018 by nitinrawat895
• 9,610 points
0 votes

Fold in spark

  • Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. 
  • If you are familiar with Scala collection it will be like using fold operation on collection. 
  • Even if you not used fold in Scala, this post will make you comfortable in using fold.

Syntax

def fold[T](acc:T)((acc,value) => acc)
The above is kind of high level view of fold api. It has following three things

  • T is the data type of RDD
  • acc is accumulator of type T which will be return value of the fold operation
  • A function , which will be called for each element in rdd with previous accumulator.

Example:

val sparkContext = new SparkContext("local", "functional")
val employeeData = List(("Jack",1000.0),("Bob",2000.0),("Carl",7000.0))
val employeeRDD = sparkContext.makeRDD(employeeData)
val dummyEmployee = ("dummy",0.0)
val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => { 
if(acc._2 < employee._2) employee else acc})
println("employee with maximum salary is"+maxSalaryEmployee)​

answered Aug 22, 2018 by samarth295
• 2,190 points

Related Questions In Apache Spark

0 votes
1 answer

Can anyone explain what is RDD in Spark?

RDD is a fundamental data structure of ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 12,890 points
511 views
0 votes
1 answer

In a Spark DataFrame how can I flatten the struct?

You can go ahead and use the ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 12,890 points
435 views
0 votes
1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 12,890 points
753 views
0 votes
1 answer
0 votes
1 answer

Can I set different protocol for SSL in Spark?

There is no protocol set by default. ...READ MORE

answered Mar 15 in Apache Spark by Karan
34 views
0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

answered Apr 19, 2018 in Apache Spark by Ashish
• 2,630 points
3,100 views
0 votes
1 answer

Efficient way to read specific columns from parquet file in spark

As parquet is a column based storage ...READ MORE

answered Apr 20, 2018 in Apache Spark by kurt_cobain
• 9,240 points
881 views
+5 votes
11 answers

Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

answered Mar 21 in Apache Spark by anonymous
20,678 views
0 votes
1 answer

How is RDD in Spark different from Distributed Storage Management? Can anyone help me with this ?

Some of the key differences between an RDD and ...READ MORE

answered Jul 26, 2018 in Apache Spark by zombie
• 3,690 points
88 views
0 votes
3 answers

Lineage Graph in Spark

Whenever a series of transformations are performed ...READ MORE

answered Aug 27, 2018 in Apache Spark by shams
• 3,580 points
1,189 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.