Why is collect in SparkR slow?

0 votes
I have a huge dataframe which has more than 300k rows. The data is in a parquet file. My machine has 4 cores and 8GB of RAM.

R version 3.0 and spark 2.0. To bring dataset into R, I used the collect() function. It took around 3-4 mins for the data to be loaded into R.

Does the collect method usually take this much time?
May 3, 2018 in Apache Spark by shams
• 3,580 points
69 views

1 answer to this question.

0 votes
It's not the collect() that is slow. Actually, Spark works on the principle of Lazy evaluations, ie. all the transformations are done in a DAG basis and the actions (here it's the collect()) is done at last using the original data, so that's why it might take time.

But having a 300K row data will take some time in loading.
answered May 3, 2018 by Data_Nerd
• 2,360 points

Related Questions In Apache Spark

0 votes
1 answer

How to use yield keyword in scala and why it is used instead of println?

Hi, The yield keyword is used because the ...READ MORE

answered Jul 5 in Apache Spark by Gitika
• 19,720 points
27 views
0 votes
1 answer

Why is Spark faster than Hadoop Map Reduce

Firstly, it's the In-memory computation, if the file ...READ MORE

answered Apr 30, 2018 in Apache Spark by shams
• 3,580 points
82 views
0 votes
1 answer

Can anyone explain what is RDD in Spark?

RDD is a fundamental data structure of ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 13,190 points
537 views
0 votes
1 answer

Spark 2.3? What is new in it?

Here are the changes in new version ...READ MORE

answered May 28, 2018 in Apache Spark by kurt_cobain
• 9,240 points
42 views
0 votes
1 answer

What do we exactly mean by “Hadoop” – the definition of Hadoop?

The official definition of Apache Hadoop given ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by Shubham
150 views
+1 vote
1 answer
0 votes
3 answers

Can we run Spark without using Hadoop?

No, you can run spark without hadoop. ...READ MORE

answered May 7 in Big Data Hadoop by pradeep
121 views
0 votes
1 answer

Joining Multiple Spark Dataframes

You can run the below code to ...READ MORE

answered Mar 26, 2018 in Big Data Hadoop by Bharani
• 4,550 points
223 views
0 votes
1 answer

cache tables in apache spark sql

Caching the tables puts the whole table ...READ MORE

answered May 4, 2018 in Apache Spark by Data_Nerd
• 2,360 points
528 views
0 votes
1 answer

Is it possible to run Spark and Mesos along with Hadoop?

Yes, it is possible to run Spark ...READ MORE

answered May 29, 2018 in Apache Spark by Data_Nerd
• 2,360 points
32 views