What makes Spark faster than MapReduce?

0 votes
Understanding the code design and logic differences between Apache Spark and Hadoop Map Reduce.
Jul 26, 2018 in Apache Spark by Neha
• 6,280 points
94 views

1 answer to this question.

0 votes

Let's first look at mapper side differences

Map Side of Hadoop Map Reduce 

  • Each Map task outputs the data in Key and Value pair.
  • The output is stored in a CIRCULAR BUFFER instead of writing to disk.
  • The size of the circular buffer is around 100 MB. If the circular buffer is 80% full by default, then the data will be spilled to disk, which are called shuffle spill files.
  • On a particular node, many map tasks are run as a result many spill files are created. Hadoop merges all the spill files, on a particular node, into one big file which is SORTED and PARTITIONED based on number of reducers.

Map side of Spark 

  • Initial Design:
    • The output of map side is written to OS BUFFER CACHE.
    • The operating system will decide if the data can stay in OS buffer cache or should it be spilled to DISK.
    • Each map task creates as many shuffle spill files as number of reducers.
    • SPARK doesn't merge and partition shuffle spill files into one big file, which is the case with Apache Hadoop.
    • Example: If there are 6000 (R) reducers and 2000 (M) map tasks, there will be (M*R) 6000*2000=12 million shuffle files. This is because, in spark, each map task creates as many shuffle spill files as number of reducers. This caused performance degradation.
    • This was the initial design of Apache Spark.
  • Reduce side of Hadoop MR:
    • PUSHES the intermediate files(shuffle files) created at the map side. And the data is loaded into memory.
    • If the buffer reaches 70% of its limit, then the data will be spilled to disk.
    • Then the spills are merged to form bigger files.
    • Finally the reduce method gets invoked.
       
  • Reduce side of Apache Spark:
     
    • PULLS the intermediate files(shuffle files) to Reduce side.
    • The data is directly written to memory.
    • If the data doesn't fit in-memory, it will be spilled to disk from spark 0.9 on-wards. Before that, an OOM(out of memory) exception would be thrown.
    • Finally the reducer functionality gets invoked.
Other important factors are as follows
  1. Spark uses "lazy evaluation" to form a directed acyclic graph (DAG) of consecutive computation stages. In this way, the execution plan can be optimized, e.g. to minimize shuffling data around. In contrast, this should be done manually in MapReduce by tuning each MR step.
  2. Spark ecosystem has established a versatile stack of components to handle SQL, ML, Streaming, Graph Mining tasks. But in the hadoop ecosystem you have to install other packages to do these individual things.
answered Jul 26, 2018 by Neha
• 6,280 points

Related Questions In Apache Spark

0 votes
1 answer

Why is Spark faster than Hadoop Map Reduce

Firstly, it's the In-memory computation, if the file ...READ MORE

answered Apr 30, 2018 in Apache Spark by shams
• 3,580 points
82 views
0 votes
1 answer

Can anyone explain what is RDD in Spark?

RDD is a fundamental data structure of ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 13,190 points
537 views
0 votes
1 answer

What is the difference between Apache Spark SQLContext vs HiveContext?

Spark 2.0+ Spark 2.0 provides native window functions ...READ MORE

answered May 25, 2018 in Apache Spark by nitinrawat895
• 10,110 points
1,672 views
0 votes
1 answer

Spark 2.3? What is new in it?

Here are the changes in new version ...READ MORE

answered May 28, 2018 in Apache Spark by kurt_cobain
• 9,240 points
42 views
0 votes
1 answer

What is Spark Piping?

Spark provides a pipe() method on RDDs. ...READ MORE

answered May 31, 2018 in Apache Spark by kurt_cobain
• 9,240 points
426 views
0 votes
1 answer

What do we mean by an RDD in Spark?

The full form of RDD is a ...READ MORE

answered Jun 18, 2018 in Apache Spark by nitinrawat895
• 10,110 points
126 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,110 points
2,053 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,110 points
197 views
0 votes
10 answers

hadoop fs -put command?

copy command can be used to copy files ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Sujay
10,502 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
765 views