Does Caching stand as the only advantage in Spark compared to Hadoop

Question

I am a beginner in Apache Spark. I see there is a lot of focus drawn to RDDs in Spark and the faster execution is made possible because of the addition of a caching unit.

Is it fair enough to create a whole new framework like Spark just to include a cache in MapReduce Tasks?

Since am a learner, I think I have a lot to learn but can anyone this doubt of mine?

ravikiran · Answer 1 · Jul 31, 2019

Spark has much lower per job and per task overhead. It gives it ability to be applied to the cases where Hadoop MR is not applicable. It is cases when reply is needed in 1-30 seconds.
Low per task overhead makes Spark more efficient for even big jobs with a lot of short tasks. As a very rough estimation - when task takes 1 second Spark will be 2 times more efficient then Hadoop MR.
Spark has lower abstraction then MR - it is graph of computations. As a result it is possible to implement more efficient processing then MR - specifically in cases when sorting is not needed. In other words - in MR we always pay for the sorting, but in Spark - we do not have to.