I don t understand the reason behind Spark RDD being immutable

Question

zombie · Answer 1 · Jul 26, 2018

Following are the reasons:
- Immutable data is always safe to share across multiple processes as well as multiple threads.
- Since RDD is immutable we can recreate the RDD any time. (From lineage graph).
- If the computation is time-consuming, in that we can cache the RDD which result in performance improvement.

I hope this helps you !!

answered Jul 26, 2018 by zombie
• 3,790 points

samarth295 · Answer 2 · Aug 24, 2018

Apache Spark on HDFS, MESOS or Local mode distributes and store transformation data in the form of RDD (Resilient Distributed DataSets).
RDDs are not just immutable but a deterministic function of their input. That means RDD can be recreated at any time.This helps in taking advantage of caching, sharing and replication. RDD isn't really a collection of data, but just a recipe for making data from other data.
Immutability rules out a big set of potential problems due to updates from multiple threads at once. Immutable data is definitely safe to share across processes.
Immutable data can as easily live in memory as on disk. This makes it reasonable to easily move operations that hit disk to instead use data in memory, and again, adding memory is much easier than adding I/O bandwidth.
RDD significant design wins, at cost of having to copy data rather than mutate it in place. Generally, that's a decent tradeoff to make: gaining the fault tolerance and correctness with no developer effort worth spending disk memory and CPU on.

answered Aug 24, 2018 by samarth295
• 2,220 points

score 0 · Answer 3 · Apr 18, 2019

There are few reasons for keeping RDD immutable as follows:

1- Immutable data can be shared easily.

2- It can be created at any point of time.

3- Immutable data can easily live on memory as on disk.

Hope the answer will helpful.