Can anyone explain what is RDD in Spark

Question

I am new to Apache Spark and when I was going through RDD, it said RDD is an immutable distributed collection of objects. I am confused about this statement. Next, I read that RDD can be created by 2 ways: 1^st is by loading from external data source & 2^nd is by distributing object collection from the driver program.

Can anyone explain to me what is RDD?

Shubham · Answer 1 · May 24, 2018

RDD is a fundamental data structure of Spark. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.