How can I minimize data transfers when working with Spark?

0 votes
Sep 19, 2018 in Apache Spark by shams
• 3,580 points
246 views

1 answer to this question.

0 votes

Minimizing data transfers and avoiding shuffling helps in writing spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:

  1. Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
  2. Using Accumulators – Accumulators help update the values of variables in parallel while executing.
  3. The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

I hope this will help!!

answered Sep 19, 2018 by zombie
• 3,710 points
Difference Between rdd dataframe dataset

Related Questions In Apache Spark

0 votes
0 answers
0 votes
1 answer

How can we optimize and minimize the memory when work with scala use case?

Hi, There is a term in Scala that is ...READ MORE

answered Jul 5 in Apache Spark by Gitika
• 25,420 points
47 views
0 votes
2 answers

In a Spark DataFrame how can I flatten the struct?

// Collect data from input avro file ...READ MORE

answered Jul 4 in Apache Spark by Dhara dhruve
1,356 views
0 votes
1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 13,350 points
1,760 views
0 votes
1 answer

Spark: How can i create temp views in user defined database instead of default database?

You can try the below code: df.registerTempTable(“airports”) sqlContext.sql(" create ...READ MORE

answered Jul 14 in Apache Spark by Ishan
189 views
0 votes
1 answer

Can I read a CSV represented as a string into Apache Spark?

You can use the following command. This ...READ MORE

answered May 3, 2018 in Apache Spark by kurt_cobain
• 9,280 points
75 views
0 votes
1 answer

How can I compare the elements of the RDD using MapReduce?

You have to use the comparison operator ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 13,350 points
507 views
+1 vote
2 answers

How can I convert Spark Dataframe to Spark RDD?

Assuming your RDD[row] is called rdd, you ...READ MORE

answered Jul 9, 2018 in Apache Spark by zombie
• 3,710 points
2,467 views
0 votes
1 answer

How is RDD in Spark different from Distributed Storage Management? Can anyone help me with this ?

Some of the key differences between an RDD and ...READ MORE

answered Jul 26, 2018 in Apache Spark by zombie
• 3,710 points
202 views