Why is collect in SparkR slow?

0 votes
I have a huge dataframe which has more than 300k rows. The data is in a parquet file. My machine has 4 cores and 8GB of RAM.

R version 3.0 and spark 2.0. To bring dataset into R, I used the collect() function. It took around 3-4 mins for the data to be loaded into R.

Does the collect method usually take this much time?
May 3, 2018 in Apache Spark by shams
• 3,580 points
139 views

1 answer to this question.

0 votes
It's not the collect() that is slow. Actually, Spark works on the principle of Lazy evaluations, ie. all the transformations are done in a DAG basis and the actions (here it's the collect()) is done at last using the original data, so that's why it might take time.

But having a 300K row data will take some time in loading.
answered May 3, 2018 by Data_Nerd
• 2,370 points

Related Questions In Apache Spark

0 votes
1 answer

How to use yield keyword in scala and why it is used instead of println?

Hi, The yield keyword is used because the ...READ MORE

answered Jul 5, 2019 in Apache Spark by Gitika
• 25,460 points
88 views
0 votes
1 answer

Why is Spark faster than Hadoop Map Reduce

Firstly, it's the In-memory computation, if the file ...READ MORE

answered Apr 30, 2018 in Apache Spark by shams
• 3,580 points
205 views
+1 vote
1 answer

Can anyone explain what is RDD in Spark?

RDD is a fundamental data structure of ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 13,370 points
703 views
0 votes
1 answer

Spark 2.3? What is new in it?

Here are the changes in new version ...READ MORE

answered May 28, 2018 in Apache Spark by kurt_cobain
• 9,290 points
90 views
0 votes
1 answer

What do we exactly mean by “Hadoop” – the definition of Hadoop?

The official definition of Apache Hadoop given ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by Shubham
308 views
+1 vote
1 answer
0 votes
3 answers

Can we run Spark without using Hadoop?

No, you can run spark without hadoop. ...READ MORE

answered May 7, 2019 in Big Data Hadoop by pradeep
257 views
0 votes
1 answer

Joining Multiple Spark Dataframes

You can run the below code to ...READ MORE

answered Mar 26, 2018 in Big Data Hadoop by Bharani
• 4,560 points
818 views
0 votes
1 answer

cache tables in apache spark sql

Caching the tables puts the whole table ...READ MORE

answered May 4, 2018 in Apache Spark by Data_Nerd
• 2,370 points
1,065 views
0 votes
1 answer

Is it possible to run Spark and Mesos along with Hadoop?

Yes, it is possible to run Spark ...READ MORE

answered May 29, 2018 in Apache Spark by Data_Nerd
• 2,370 points
91 views