Hadoop Cluster Configuration Files

Big Data and Hadoop (165 Blogs) Become a Certified Professional

In last few years Apache Hadoop has emerged as the technology for solving Big Data problems and for improved Business Analytics. One example of this is how Sears Holding has moved to Hadoop from the traditional Oracle Exadata, Teradata, SAS system. Another recent big entrant to Hadoop bandwagon is Walmart’s Hadoop implementation.

In our previous blog we have discussed, how to create a Hadoop Cluster on AWS in 30 minutes.
In continuation to that, this blog talks about important Hadoop Cluster Configuration Files.

The following table lists the same.

All these files are available under ‘conf’ directory of Hadoop installation directory.

Here is a listing of these files in the File System:

Let’s look at the files and their usage one by one!

hadoop-env.sh

This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
As Hadoop framework is written in Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh. This variable directs Hadoop daemon to the Java path in the system.

This file is also used for setting another Hadoop daemon execution environment such as heap size (HADOOP_HEAP), hadoop home (HADOOP_HOME), log file location (HADOOP_LOG_DIR), etc.

Note: For the simplicity of understanding the cluster setup, we have configured only necessary parameters to start a cluster.Get a better understanding of Hadoop Cluster configuration files from this Big Data Course

The following three files are the important configuration files for the runtime environment settings of a Hadoop cluster.

core-site.sh

This file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

Where hostname and port are the machine and port on which NameNode daemon runs and listens. It also informs the Name Node as to which IP and port it should bind. The commonly used port is 8020 and you can also specify IP address rather than hostname.

hdfs-site.sh

This file contains the configuration settings for HDFS daemons; the Name Node, the Secondary Name Node, and the data nodes.

You can also configure hdfs-site.xml to specify default block replication and permission checking on HDFS. The actual number of replications can also be specified when the file is created. The default is used if replication is not specified in create time.

The value “true” for property ‘dfs.permissions’ enables permission checking in HDFS and the value “false” turns off the permission checking. Switching from one parameter value to the other does not change the mode, owner or group of files or directories.

mapred-site.sh

This file contains the configuration settings for MapReduce daemons; the job tracker and the task-trackers. The mapred.job.tracker parameter is a hostname (or IP address) and port pair on which the Job Tracker listens for RPC communication. This parameter specify the location of the Job Tracker to Task Trackers and MapReduce clients.

You can replicate all of the four files explained above to all the Data Nodes and Secondary Namenode. These files can then be configured for any node specific configuration e.g. in case of a different JAVA HOME on one of the Datanodes.

The following two file ‘masters’ and ‘slaves’ determine the master and salve Nodes in Hadoop cluster.

Masters

This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file at Master server contains a hostname Secondary Name Node servers.

The ‘masters’ file on Slave Nodes is blank.

Slaves

The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node and Task Tracker servers.

The ‘slaves’ file on Slave server contains the IP address of the slave node. Notice that the ‘slaves’ file at Slave node contains only its own IP address and not of any other Data Nodes in the cluster.

Get a better understanding of Hadoop Cluster configuration files from this Big Data Course in Bangalore.

Got a question for us? Please mention them in the comments section and we will get back to you.

Related Posts:

Get started with Big Data and Hadoop

Get started with Comprehensive MapReduce

Hadoop Cluster Configuration Files

hadoop-env.sh

core-site.sh

hdfs-site.sh

mapred-site.sh

Masters

Slaves

Recommended videos for you

Apache Spark Will Replace Hadoop ! Know Why

MapReduce Design Patterns – Application of Join Pattern

Logistic Regression In Data Science

Hive Tutorial – Understanding Hive In Depth

Is It The Right Time For Me To Learn Hadoop ? Find out.

Top Hadoop Interview Questions and Answers – Ace Your Interview

Big Data – XML Parsing With MapReduce

New-Age Search through Apache Solr

Python for Big Data Analytics

Apache Spark Redefining Big Data Processing

Big Data Processing With Apache Spark

Filtering on HBase Using MapReduce Filtering Pattern

Bulk Loading Into HBase With MapReduce

Is Hadoop A Necessity For Data Science?

5 Scenarios: When To Use & When Not to Use Hadoop

MapReduce Tutorial – All You Need To Know About MapReduce

Apache Spark For Faster Batch Processing

Ways to Succeed with Hadoop in 2015

Pig Tutorial – Know Everything About Apache Pig Script

What Is Hadoop – All You Need To Know About Hadoop

Recommended blogs for you

Splunk Architecture: Tutorial On Forwarder, Indexer And Search Head

Apache Hadoop HDFS Architecture

Hadoop Cluster : The all you need to know Guide

Career Advantages of Hadoop Certification

Basics of HBase

What is Big Data? – A Beginner’s Guide to the World of Big Data

Anatomy of a MapReduce Job in Apache Hadoop

Elasticsearch Tutorial – Power Up Your Searches

Operators in Apache Pig: Part 1- Relational Operators

Copy Activity in Azure Data Factory and Azure Synapse Analytics

How to Plan the Capacity of a Hadoop Cluster?

Azure Synapse vs. Databricks – What Are the Differences?

Oozie Tutorial: Learn How to Schedule your Hadoop Jobs

4 Practical Reasons to Learn Hadoop 2.0

Drilling Down On Apache Drill, the New-Age Query Engine

Top Hadoop Interview Questions To Prepare In 2025 – Apache Hive

Why do we need Hadoop for Data Science?

How To Install MongoDB on Mac Operating System?

Setting Up A Multi Node Cluster In Hadoop 2.X

Hive Data Models: Designing Efficient Data Structures

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

Microsoft Fabric DP-700 Certification Trainin ...

PySpark Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Apache Kafka Certification Training Course

ELK Stack Training & Certification

Apache Spark and Scala Certification Training ...

Splunk Certification Training: Power User and ...

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Hadoop Cluster Configuration Files