Hadoop Distributed File System

Uma Mahesh yadav C says:
Mar 20, 2017 at 11:51 pm GMT
Hi Ashish, Thanks for explaining very clearly.. Splitting file in to data bcks can be done by HDFS client? If so, mvFromLOcal, put commands also will spilt the file in to data blocks ? can you explain how this will happen please
Reply
- EdurekaSupport says:
  Mar 24, 2017 at 2:57 pm GMT
  Hey Uma Mahesh, thanks for checking out out blog. We’re glad you found it useful.
  The moment we execute the copyFromLocal command. The Hadoop environment will fetch the file from the provided path and split it into blocks .
  Hope this helps. Cheers!
  Reply
Nam Nguyen says:
Feb 8, 2017 at 3:55 am GMT
Thank you so much.
Reply
somu says:
Dec 17, 2016 at 2:46 am GMT
Excellent write-up ! Simple enough for a layman to understand and that is what we need.
Reply
- EdurekaSupport says:
  Dec 21, 2016 at 2:56 pm GMT
  Thanks for the wonderful feedback, Somu! Do check out some of our other HDFS blogs here: https://www.edureka.co/blog/category/big-data-analytics?s=hdfs. Cheers!
  Reply
Rishav Kumar says:
Nov 8, 2016 at 5:37 am GMT
Very well explained, the sequence of explaining is too good.
I wanted to know if Hadoop uses any compression techniques to cope up with increased disk space requirement (default: 3 times) associated with data replication.
Reply
- EdurekaSupport says:
  Nov 9, 2016 at 9:51 am GMT
  Hey Rishav, thanks for checking out the blog. First of all, the HDFS is deployed on low cost commodity hardware which is bound to fail. This is the most important reason why data replication is done i.e. to make the system Fault Tolerant and Reliable. And yes, Hadoop supports many codec utilities like gzip, bzip2, Snappy etc. But, there is always a tradeoff between compression ratio and compress/decompress speed.
  Also, the data are stored as blocks in HDFS, you can’t apply those codec utilities where decompression of a block can’t take place without having other blocks of the same file (residing on other DataNodes). In other words they need the whole file for decompression. These codecs are called non -splittable codecs. At last, HDFS cluster is scalable i.e. you can add more nodes to the cluster to increase the storage capacity
  Hope this helps. Cheers!
  Reply
Tanmay Jambavlikar says:
Jun 27, 2016 at 8:29 am GMT
If the NameNode fails what are the typical steps after addressing the relevant hardware problem to bring the name node online. I am asking this question from the fact that Fsimage must be the last up-to-date copy of the Meta-Data critical for hadoop cluster to operate and there is no automatic fail-over capability. So do we some how restore this copy on NameNode and then start the all the necessary daemons on the namenode? Will the cluster take this is Fsimage file as a valid input and then start its operations normally?
Reply
- EdurekaSupport says:
  Nov 10, 2016 at 12:58 pm GMT
  Hey Tanmay, thanks for checking out the blog. Let us understand this NameNode recovery process by taking an example where I am a Hadoop Admin and I have a situation where the NameNode has crashed in my HDFS cluster. So, the following steps will be taken by me to make the cluster up and running:
  1. I will use the file system metadata replica (FsImage) to start a new NameNode.
  2. Then, I will configure the DataNodes and clients so that they can acknowledge this new NameNode that I have started.
  3. Now the new NameNode will start serving the client after it has completed loading the last checkpointed FsImage (for meta data information) and received enough block reports from the DataNodes to leave the safe mode.
  This takes 30 minutes on an average. On large Hadoop clusters this NameNode recovery process may consume a lot of time and this becomes even a greater challenge in the case of the routine maintenance. This is why we have HDFS HA Architecture and HDFS Federation Architecture which is covered in a separate blog here: https://www.edureka.co/blog/overview-of-hadoop-2-0-cluster-architecture-federation/. Hope this helps. Cheers!
  Reply
Hareesh@Disqus says:
Aug 9, 2015 at 4:51 pm GMT
The following i have questions regarding HDFS and MR
1.Is it possible to store multiple files in HDFS with different block sizes?
2.Is it possible to give whole file as input to mapper?
Thanks
Hareesh A
Reply
- EdurekaSupport says:
  Aug 10, 2015 at 9:53 am GMT
  Hi Hareesha,
  Thank you for reaching out to us.
  Yes it is possible in both situations but it will depend on the data blocks as well as the way in which they are applied. You can get in touch with us for further clarification by contacting our sales team on +91-8880862004 (India) or 1800 275 9730 (US toll free). You can also mail us on sales@edureka.co.
  Reply
Bhupendra Pandey says:
Jun 17, 2015 at 5:56 am GMT
What is blockreport? Why datanodes need to send it to Namenode at regular interval? Doesn’t namenode keep store metadata and block details in namespace at the time of file write?
Reply
- Shaheer Kidwai says:
  Jun 19, 2015 at 1:24 pm GMT
  In some interval of time, the DataNode sends a block report to the
  NameNode. The block report allows the NameNode to repair any divergence that may have occurred between the replica information on the NameNode and on the DataNodes. The Block and Replica Management may use this revised information to enqueue block replication or deletion commands for this or other DataNodes.
  During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The NameNode then
  schedules creation of new replicas of those blocks on other DataNodes.
  Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NameNode’s block
  allocation and load balancing decisions.
  Reply
  - EdurekaSupport says:
    Jun 22, 2015 at 6:50 am GMT
    Thanks for responding to this question Shaheer.
    Reply
  - Ujwala says:
    Jul 2, 2015 at 10:14 am GMT
    if the default heartbeat interval is three seconds, isnt ten minutes too long to conclude that data node is out of service? is it the default number? can this be configured?
    Reply
    - EdurekaSupport says:
      Jul 16, 2015 at 8:59 am GMT
      Hi Ujwala, the default interval of time is 10 minutes and we can’t change it. The Namenode wait for the heartbeat from the Datanode till the interval of time mentioned and if it doesn’t receive the heartbeat then it consider that particular Datanode to be out of service and creates new replicas of those blocks on other Datanodes.
      Reply
Deven Kalra says:
Jan 15, 2015 at 4:42 am GMT
When writing the data into physical blocks in the nodes, if one node fails, does it stop the writing process goes back to the name node, name node re writes nodes to write or I am wrong?
Reply
- EdurekaSupport says:
  Jan 19, 2015 at 8:20 am GMT
  Hi Deven, when writing the data into physical blocks in the nodes, namenode receives heart beat( a kind of signal) from the datanodes which indicates if the node is alive or not. If the namenode does not receive heartbeat within the specific time period then it assumes that the datanode has failed and then writes the data to a different data block.
  Hope this helps!!
  Reply
  - Biswa Bihari panda says:
    Mar 5, 2015 at 4:49 pm GMT
    So it will write from the begining ?
    Reply
    - EdurekaSupport says:
      Mar 10, 2015 at 7:29 am GMT
      Yes, you are right Biswa!!!
      Reply
rishikhurana17 says:
Oct 3, 2014 at 2:51 am GMT
very well explained .. thanx
Reply
- EdurekaSupport says:
  Oct 7, 2014 at 5:37 am GMT
  Thanks Rishi!! Feel free to go through our other blog posts as well: https://www.edureka.co/blog/category/big-data-analytics/
  Reply
Praveen Sharp says:
Sep 3, 2013 at 6:07 am GMT
I’ve read similar things on other blogs. Ill take your word for it. Stay solid! your pal.
Hmm, that is some compelling information youve got going! Makes me scratch my head and think. Keep up the good writing!
Reply

1 2 3 Next »

Introduction to Big Data

Introduction to Hadoop

Hadoop Installation

YARN & MapReduce

Data Loading Tools

Apache Pig

Apache Hive

DynamoDB vs MongoDB: Which One Meets Your Business Needs Better?

How To Install MongoDB On Windows Operating System?

How To Install MongoDB On Ubuntu Operating System?

How To Install MongoDB on Mac Operating System?

How To Create User In MongoDB?

Apache HBase

Apache Oozie

Hadoop Interview Questions

Career Guidance

Big Data

Apache Hadoop HDFS Architecture

Apache Hadoop HDFS Architecture

Introduction:

Big Data & Hadoop Full Course – Learn Hadoop In 10 Hours | Hadoop Tutorial For Beginners | Edureka

HDFS Architecture:

NameNode:

Functions of NameNode:

DataNode:

Functions of DataNode:

Secondary NameNode:

Functions of Secondary NameNode:

Blocks:

Replication Management:

Rack Awareness:

Advantages of Rack Awareness:

HDFS Read/ Write Architecture:

HDFS Write Architecture:

1. Set up of Pipeline:

2. Data Streaming:

3. Shutdown of Pipeline or Acknowledgement stage:

HDFS Read Architecture:

Recommended videos for you

Pig Tutorial – Know Everything About Apache Pig Script

What Is Hadoop – All You Need To Know About Hadoop

Apache Spark Will Replace Hadoop ! Know Why

Advanced Security In Hadoop Cluster

Apache Spark Redefining Big Data Processing

Power of Python With BigData

Filtering on HBase Using MapReduce Filtering Pattern

MapReduce Design Patterns – Application of Join Pattern

Hadoop Cluster With High Availability

Real-Time Analytics with Apache Storm

MapReduce Tutorial – All You Need To Know About MapReduce

What is Big Data and Why Learn Hadoop!!!

What is Apache Storm all about?

Introduction to Big Data TDD and Pig Unit

Logistic Regression In Data Science

Hadoop Architecture – Hadoop Tutorial on HDFS Architecture

Streaming With Apache Spark and Scala

When not to use Hadoop

Introduction to Apache Solr-1

Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

Recommended blogs for you

Big Data Engineer Salary – How Much Can You Expect As A Big Data Engineer?

Overview of Hadoop 2.0 Cluster Architecture Federation

Importance of Hadoop Tutorial

PySpark Tutorial – Learn Apache Spark Using Python

What is Azure Data Factory – Here’s Everything You Need to Know

DynamoDB vs MongoDB: Which One Meets Your Business Needs Better?

Apache Pig UDF: Part 1 – Eval, Aggregate & Filter Functions

DBInputFormat to Transfer Data From SQL to NoSQL Database

Top Hadoop Developer Skills You Need to Master in 2025

Hadoop Interview Questions For 2025 – Setting Up Hadoop Cluster

How to Plan the Capacity of a Hadoop Cluster?

Essential Hadoop Tools for Crunching Big Data

Distributed Caching With Broadcast Variables: Apache Spark

Do You Need Java To Learn Hadoop?

HBase Tutorial: HBase Introduction and Facebook Case Study

Jobs In Hadoop

What is CCA-175 Spark and Hadoop Developer Certification?

Increasing Demand for ‘ Hadoop and NoSQL Skills ’

Introduction to Hadoop 2.0 and Advantages of Hadoop 2.0 over 1.0