Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.
Now this RDD creation can be done in two ways:
First, is to refer to an external dataset present in the hdfs or local i.e,
Second, is parallelizing an existing collection using sc.parallelize i.e., sc.parallelize API will help in loading user created data which is not mandatorily coming from a directory.
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
So, when we are using sc.parallelize, we are actually using it for RDD creation only.