Why do we use sc parallelize

Question

Could you please let me know when RDD is already distributed over nodes in a cluster and will be acted upon in parallel, what is the use of parallelize. Why do we use sc.parallelize?

score 0 · Answer 1 · Jul 11, 2019

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.

Now this RDD creation can be done in two ways:

First, is to refer to an external dataset present in the hdfs or local i.e,

sc.textFile("/user/edureka_425640/patient_records.txt")

Second, is parallelizing an existing collection using sc.parallelize i.e., sc.parallelize API will help in loading user created data which is not mandatorily coming from a directory.

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

So, when we are using sc.parallelize, we are actually using it for RDD creation only.

answered Jul 11, 2019 by Suman

Why do we use sc parallelize

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Apache Spark

How to use Scala anonymous functions and why do we use it?

Why do we need App in Scala?

What do we mean by an RDD in Spark?

Not able to use sc in spark shell

How do I get number of columns in each line from a delimited file??

Hadoop Mapreduce word count Program

hadoop.mapred vs hadoop.mapreduce?

hadoop fs -put command?

In what kind of use cases has Spark outperformed Hadoop in processing?

Spark context (sc) not found

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES