Hadoop Administration Interview Questions and Answers in 2024

**Daemons required to run a Hadoop Cluster**
Daemon	Description
DataNode	It stores the data in the Hadoop File System which contains more than one DataNode, with data replicated across them
NameNode	It is the core of an HDFS that keeps the directory tree of all files is present in the file system, and tracks where the file data is kept across the cluster
SecondaryNameNode	It is a specially dedicated node in HDFS cluster that keep checkpoints of the file system metadata present on namenode
NodeManager	It is responsible for launching and managing containers on a node which execute tasks as specified by the AppMaster
ResourceManager	It is the master that helps in managing the distributed applications running on the YARN system by arbitrating all the available cluster resources

omkar rom says:
Feb 13, 2017 at 2:46 pm GMT
1Q)How many nodes do you think can be present in one cluster?
2Q)Which MapReduce version have you configured on your Hadoop cluster?
3Q)Explain any notable Hadoop use case by a company, that helped maximize its profitability?
4Q)Do you follow a standard procedure to deploy Hadoop?
5Q)How will you manage a Hadoop system?
6Q)Which tool will you prefer to use for monitoring Hadoop and HBase clusters?
Reply
omkar rom says:
Feb 13, 2017 at 2:42 pm GMT
Please answer to all of my questions in which ill make a note of all those answers and prepare for interviews.Thank you
Reply
- EdurekaSupport says:
  Feb 18, 2017 at 11:57 am GMT
  Hey Omkar, that’s a really long list of questions. :) But, good news, we will be providing answers to many of these questions and more in an upcoming blog. Do subscribe to our blog to stay posted. Cheers!
  Reply
omkar rom says:
Feb 13, 2017 at 2:40 pm GMT
4Q)ide the cluster size when setting up a Hadoop cluster?
5Q)How can you run Hadoop and real-time processes on the same cluster?
6Q)If you get a connection refused exception – when logging onto a machine of the cluster, what could be the reason? How will you solve this issue?
7Q)How can you identify and troubleshoot a long running job?
8Q)How can you decide the heap memory limit for a NameNode and Hadoop Service?
9Q)If the Hadoop services are running slow in a Hadoop cluster, what would be the root cause for it and how will you identify it?
10Q)Configure slots in Hadoop 2.0 and Hadoop 1.0.
11Q)In case of high availability, if the connectivity between Standby and Active NameNode is lost. How will this impact the Hadoop cluster?
12Q)What is the minimum number of ZooKeeper services required in Hadoop 2.0 and Hadoop 1.0?
13QIf the hardware quality of few machines in a Hadoop Cluster is very low. How will it affect the performance of the job and the overall performance of the cluster?
14Q)Explain the difference between blacklist node and dead node.
15Q)How can you increase the NameNode heap memory?
16Q)Configure capacity scheduler in Hadoop.
17Q)After restarting the cluster, if the MapReduce jobs that were working earlier are failing now, what could have gone wrong while restarting?
18Q)Explain the steps to add and remove a DataNode from the Hadoop cluster.
In a large busy Hadoop cluster-how can you identify a long running job?
19Q)When NameNode is down, what does the JobTracker do?
20Q)When configuring Hadoop manually, which property file should be modified to configure slots?
21Q)How will you add a new user to the cluster?
22Q)What is the advantage of speculative execution? Under what situations, Speculative Execution might not be beneficial?
Reply
omkar rom says:
Feb 13, 2017 at 2:37 pm GMT
1Q)How will you initiate the installation process if you have to setup a Hadoop Cluster for the first time?
2Q)How will you install a new component or add a service to an existing Hadoop cluster?
3Q)If Hive Metastore service is down, then what will be its impact on the Hadoop cluster?
Reply
- EdurekaSupport says:
  Feb 15, 2017 at 9:38 am GMT
  Hey Omkar, thanks for checking out our tutorial! Here are the answers:
  1.You can do it virtually by using VMware Tools or Virtual box. You need atleast 8 GB RAM and sufficient hard disk space. Create 3 Virtual machines and make one of them a namenode and making the rest of two as datanodes by changing the configurations and providing privileges.
  Now for connection between the nodes, for multinode clustering you need to assign the ip address of the datanodes in the /etc/hosts file in the namenode machine. After establishing connection you can get to know how a cluster works.
  2.Using Hortonworks distribution, you can use Apache Ambari for adding/removing service to hadoop cluster.
  Cloudera also provides its own cluster manager called Cloudera Management Service.
  These tools provides easy installation of services. if in case you are not using these distribution you need to do manually the set up of services on nodes.
  3.It is not mandatory to have metastore in the cluster itself. Any machine(inside or outside the cluster) having a JDBC-compliant database can be used for the metastore.
  Hence, if the Hive metastore service is down, Hadoop cluster just works fine.
  Hive data (not metadata) is spread across Hadoop HDFS DataNode servers. Typically, each block of data is stored on 3 different DataNodes. The NameNode keeps track of which DataNodes have which blocks of actual data.
  For a Hive production environment, the metastore service should run in an isolated JVM. Hive processes can communicate with the metastore service using Thrift. The Hive metastore data is persisted in an ACID database such as Oracle DB or MySQL. You can use SQL to find out what is in the Hive metastore.
  Hope this helps. Cheers!
  Reply
  - omkar rom says:
    Feb 15, 2017 at 9:52 am GMT
    Thanks a ton!! i also posted some more questions on your blog expecting the same from your end asap.
    Reply
aefwon says:
Dec 19, 2016 at 9:23 am GMT
There are multiple server and client application components
If I say that zookeeper server is configured to maintain the cluster configuration, does the server component run on namenode and zookeeper client component run on datanodes?
Reply
- EdurekaSupport says:
  Dec 22, 2016 at 12:34 pm GMT
  +aefwon, thanks for checking out our blog! Zookeeper stores configuration data and settings in a centralized repository so that it can be accessed from anywhere.
  https://uploads.disquscdn.com/images/c5f3a452b198457744712313fdb791cc1c2729e1980c92c3393473364ae6b378.png
  Hadoop ZooKeeper is a distributed application that follows a simple client-server model where clients are nodes that make use of the service, and servers are nodes that provide the service. Multiple server nodes are collectively called ZooKeeper ensemble. At any given time, one ZooKeeper client is connected to at least one ZooKeeper server. A master node is dynamically chosen in consensus within the ensemble; thus usually, an ensemble of Zookeeper is an odd number so that there is a majority of vote. If the master node fails, another master is chosen in no time and it takes over the previous master. Hope this helps. Cheers!
  Reply
Prashant says:
Sep 21, 2016 at 8:41 am GMT
what is the default number of zookeeper services run in hadoop and WHY that many number?
Reply
- EdurekaSupport says:
  Sep 21, 2016 at 11:40 am GMT
  Hey Prashant, thanks for checking out the blog. To answer your query, by default no zookeeper runs unless until we start it. Now, once we start the zookeper then only one zookeeper will be running as it is daemon but if you want to run more daemons then we have to go for multinode cluster where we can install zookeeper on each system. But for one system there will be one zookeeper. Hope this helps.
  Reply
  - Prashant says:
    Sep 21, 2016 at 5:39 pm GMT
    Thank u for the rely and answer Edureka…I was asked in an interview that what is the default zookeeper services running, i said 3, Interviewer: asked me why 3 y not 1, 2, 4 and so on..
    Reply
    - EdurekaSupport says:
      Sep 22, 2016 at 6:49 am GMT
      Glad we could help. :) Cheers!
      Reply
omkar rom says:
Apr 7, 2016 at 5:31 am GMT
how to restart a cluster without making the namenode shutdown
Reply
- Ayan Mukhuty says:
  Sep 6, 2016 at 3:32 pm GMT
  Can we actually do that?
  Reply
  - EdurekaSupport says:
    Sep 15, 2016 at 6:59 am GMT
    Hey Ayan, please refer to the response above. Cheers!
    Reply
- EdurekaSupport says:
  Sep 15, 2016 at 6:58 am GMT
  Hey Omkar, thanks for checking out the blog. In general, restarting the cluster means restarting all the services of Hadoop. But if you want to restart the a cluster without stopping the Namenode service, please follow the steps given below:
  1. Stop all the daemons except namenode
  mr-jobhistory-daemon.sh stop historyserver
  yarn-daemons.sh stop nodemanager
  yarn-daemon.sh stop resourcemanager
  hadoop-daemons.sh stop datanode
  2. Enter the namenode into safemode and save the namespace
  hadoop dfsadmin -safemode enter
  hadoop dfsadmin –saveNamespace
  3. Now For starting the cluster : Leave the safemode for namenode
  hadoop dfsadmin -safemode leave
  4. Start all the daemons except namenode
  Hope this helps!
  Reply

Hadoop Administration Interview Questions and Answers For 2024

Name the daemons required to run a Hadoop cluster?

Daemons required to run a Hadoop Cluster

How do you read a file from HDFS?

Explain checkpointing in Hadoop and why is it important?

What is default block size in HDFS and what are the benefits of having smaller block sizes?

What are two main modules which help you interact with HDFS and what are they used for?

How can I setup Hadoop nodes (data nodes/namenodes) to use multiple volumes/disks?

What are schedulers and what are the three types of schedulers that can be used in Hadoop cluster?

How do you decide which scheduler to use?

Why are ‘dfs.name.dir’ and ‘dfs.data.dir’ parameters used ? Where are they specified and what happens if you don’t specify these parameters?

What is file system checking utility FSCK used for? What kind of information does it show? Can FSCK show information about files which are open for writing by a client?

What are the important configuration files that need to be updated/edited to setup a fully distributed mode of Hadoop cluster 1.x ( Apache distribution)?

Recommended videos for you

Spark SQL | Apache Spark

What is Big Data and Why Learn Hadoop!!!

Apache Spark Redefining Big Data Processing

Introduction to Apache Solr-1

5 Things One Must Know About Spark

Pig Tutorial – Know Everything About Apache Pig Script

Real-Time Analytics with Apache Storm

Tailored Big Data Solutions Using MapReduce Design Patterns

Administer Hadoop Cluster

Ways to Succeed with Hadoop in 2015

Big Data Processing with Spark and Scala

Improve Customer Service With Big Data

Python for Big Data Analytics

Advanced Security In Hadoop Cluster

Is Hadoop A Necessity For Data Science?

When not to use Hadoop

MapReduce Tutorial – All You Need To Know About MapReduce

New-Age Search through Apache Solr

Filtering on HBase Using MapReduce Filtering Pattern

5 Scenarios: When To Use & When Not to Use Hadoop

Recommended blogs for you

Pig Programming: Apache Pig Script with UDF in HDFS Mode

Rio Olympics 2016: Big Data powers the biggest sporting spectacle of the year!

Dataframes in Spark: All you need to know about Structured Data Processing

Operators in Apache Pig: Part 1- Relational Operators

Introduction of Hadoop Architecture

Top Apache Spark Interview Questions You Should Prepare In 2024

Hive & Yarn Get Electrified By Spark

Top Hadoop Interview Questions To Prepare In 2024 – Apache Hive

Top Hadoop Interview Questions On Apache PIG For 2024

How essential is Hadoop Training?

What are Kafka Streams and How are they implemented?

Introduction to Apache MapReduce and HDFS

Distributed Caching With Broadcast Variables: Apache Spark

Using Big Data to Boost Telecom’s Marketing Capabilities

MapReduce Example: Reduce Side Join in Hadoop MapReduce

What is the difference between Big Data and Hadoop?

Steps to Create UDF in Apache Pig

What are the Key Terminologies in Hadoop Security?

Big Data Analytics: Turning Insights into Action

Apache Hive Installation on Ubuntu

Join the discussion Cancel reply

Trending Courses in Big Data

Azure Data Engineer Certification (DP-203) Co ...

PySpark Course Online Training

Big Data Hadoop Certification Training Course

Apache Spark and Scala Certification Training ...

Apache Kafka Certification Training Course

Splunk Certification Training: Power User and ...

Leveraging Big Data for Business Intelligence ...

ELK Stack Training & Certification

Apache Solr Certification Training

Big Data Hadoop Administration Certification ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Hadoop Administration Interview Questions and Answers For 2024