Spark Processing Internals

0 votes
Hi Team,

I would like to know when a job is submitted to spark what is the process details that follows. I mean how the Driver submits tasks to executors and how the executors send a response that they are alive to the driver and moreover what is the fault tolerance method in case the Executor fails. The overall details of spark processing in depth
Jul 15 in Apache Spark by John
26 views

1 answer to this question.

0 votes

Spark cluster components
Spark uses a master/slave architecture. As you can see in the figure, it has one central coordinator (Driver) that communicates with many distributed workers (executors). The driver and each of the executors run in their own Java processes.
 
DRIVER
 
The driver is the process where the main method runs. First it converts the user program into tasks and after that it schedules the tasks on the executors.
 
EXECUTORS
 
Executors are worker nodes' processes in charge of running individual tasks in a given Spark job. They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application. Once they have run the task they send the results to the driver. They also provide in-memory storage for RDDs that are cached by user programs through Block Manager.
 
APPLICATION EXECUTION FLOW
 
With this in mind, when you submit an application to the cluster with spark-submit this is what happens internally:
 
=> A standalone application starts and instantiates a SparkContext instance (and it is only then when you can call the       application a driver).
=> The driver program ask for resources to the cluster manager to launch executors.
=> The cluster manager launches executors.
 
=> The driver process runs through the user application. Depending on the actions and transformations over RDDs            task are sent to executors.
=> Executors run the tasks and save the results.
=> If any worker crashes, its tasks will be sent to different executors to be processed again.
=> With SparkContext.stop() from the driver or if the main method exits/crashes all the executors will be terminated           and the cluster resources will be released by the cluster manager.
In case of failures,
Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. For example, if the node running a partition of a map() operation crashes, Spark will rerun it on another node; and even if the node does not crash but is simply much slower than other nodes, Spark can preemptively launch a “speculative” copy of the task on another node, and take its result if that finishes.
Executors reports heartbeat and partial metrics for active tasks to HeartbeatReceiver on the driver. By default, Interval after which an executor reports heartbeat and metrics for active tasks to the driver. is 10 seconds.
spark HeartbeatReceiver Heartbeat.png
I hope this helps.Jimmy
answered Jul 15 by Jimmy

Related Questions In Apache Spark

0 votes
1 answer

Do real-time data processing is possible with Spark SQL?

Hey, Real-time data processing is not possible directly ...READ MORE

answered Jul 5 in Apache Spark by Gitika
• 25,300 points
26 views
0 votes
1 answer

Spark memory processing on a not temporary table

Temporary table is more like an index ...READ MORE

answered Jul 14 in Apache Spark by Suri
19 views
0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

answered Apr 19, 2018 in Apache Spark by Ashish
• 2,630 points
3,821 views
0 votes
1 answer

Is there any way to check the Spark version?

There are 2 ways to check the ...READ MORE

answered Apr 19, 2018 in Apache Spark by nitinrawat895
• 10,490 points
994 views
0 votes
1 answer
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,490 points
2,305 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,490 points
236 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
11,894 views
0 votes
1 answer

In what kind of use cases has Spark outperformed Hadoop in processing?

I can list some but there can ...READ MORE

answered Sep 19, 2018 in Apache Spark by zombie
• 3,690 points
58 views
+5 votes
11 answers

Concatenate columns in apache spark dataframe

its late but this how you can ...READ MORE

answered Mar 21 in Apache Spark by anonymous
25,463 views