Important Apache Spark Interview Questions Bank

Question

With questions and answers around Spark Core, Spark Streaming,Spark SQL, GraphX, MLlib among others, it is difficult to make your gateway to your next Spark job. To get a brief idea of the most frequently asked questions, refer this link here:- http://bit.ly/2BxbJCi
If anyone was asked a question not covered in this blog, please share the questions below. I'll get it added to the blog so that interviewees can use it in the future.

findingbugs · Answer 1 · Aug 22, 2018

What is RDD?

RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –

Immutable – RDDs cannot be altered.
Resilient – If a node holding the partition fails the other node takes the data.

answered Aug 22, 2018 by findingbugs
• 4,780 points

Thank you findingbugs

commented Aug 22, 2018 by Priyaj
• 58,100 points

eatcodesleeprepeat · Answer 2 · Aug 22, 2018

Hello
I wanted to know as what are the different cluster managers in Apache Spark

answered Aug 22, 2018 by eatcodesleeprepeat
• 4,710 points

well to describe it an easy way we can go like,
The 3 different clusters managers supported in Apache Spark are:

    YARN
    Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
    Standalone deployments – Well suited for new deployments which only run and are easy to set up.

commented Aug 22, 2018 by Priyaj
• 58,100 points

bug_seeker · Answer 3 · Aug 22, 2018

Hi Priyaj
I have this one question

What is lineage graph?

answered Aug 22, 2018 by bug_seeker
• 15,520 points

Hello bug_seeker
The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.