Hadoop Distributed File System | Apache Hadoop HDFS Architecture

Uma Mahesh yadav C says:
Mar 20, 2017 at 11:51 pm GMT
Hi Ashish, Thanks for explaining very clearly.. Splitting file in to data bcks can be done by HDFS client? If so, mvFromLOcal, put commands also will spilt the file in to data blocks ? can you explain how this will happen please
Reply
- EdurekaSupport says:
  Mar 24, 2017 at 2:57 pm GMT
  Hey Uma Mahesh, thanks for checking out out blog. We’re glad you found it useful.
  The moment we execute the copyFromLocal command. The Hadoop environment will fetch the file from the provided path and split it into blocks .
  Hope this helps. Cheers!
  Reply
Nam Nguyen says:
Feb 8, 2017 at 3:55 am GMT
Thank you so much.
Reply
somu says:
Dec 17, 2016 at 2:46 am GMT
Excellent write-up ! Simple enough for a layman to understand and that is what we need.
Reply
- EdurekaSupport says:
  Dec 21, 2016 at 2:56 pm GMT
  Thanks for the wonderful feedback, Somu! Do check out some of our other HDFS blogs here: https://www.edureka.co/blog/category/big-data-analytics?s=hdfs. Cheers!
  Reply
Rishav Kumar says:
Nov 8, 2016 at 5:37 am GMT
Very well explained, the sequence of explaining is too good.
I wanted to know if Hadoop uses any compression techniques to cope up with increased disk space requirement (default: 3 times) associated with data replication.
Reply
- EdurekaSupport says:
  Nov 9, 2016 at 9:51 am GMT
  Hey Rishav, thanks for checking out the blog. First of all, the HDFS is deployed on low cost commodity hardware which is bound to fail. This is the most important reason why data replication is done i.e. to make the system Fault Tolerant and Reliable. And yes, Hadoop supports many codec utilities like gzip, bzip2, Snappy etc. But, there is always a tradeoff between compression ratio and compress/decompress speed.
  Also, the data are stored as blocks in HDFS, you can’t apply those codec utilities where decompression of a block can’t take place without having other blocks of the same file (residing on other DataNodes). In other words they need the whole file for decompression. These codecs are called non -splittable codecs. At last, HDFS cluster is scalable i.e. you can add more nodes to the cluster to increase the storage capacity
  Hope this helps. Cheers!
  Reply
Tanmay Jambavlikar says:
Jun 27, 2016 at 8:29 am GMT
If the NameNode fails what are the typical steps after addressing the relevant hardware problem to bring the name node online. I am asking this question from the fact that Fsimage must be the last up-to-date copy of the Meta-Data critical for hadoop cluster to operate and there is no automatic fail-over capability. So do we some how restore this copy on NameNode and then start the all the necessary daemons on the namenode? Will the cluster take this is Fsimage file as a valid input and then start its operations normally?
Reply
- EdurekaSupport says:
  Nov 10, 2016 at 12:58 pm GMT
  Hey Tanmay, thanks for checking out the blog. Let us understand this NameNode recovery process by taking an example where I am a Hadoop Admin and I have a situation where the NameNode has crashed in my HDFS cluster. So, the following steps will be taken by me to make the cluster up and running:
  1. I will use the file system metadata replica (FsImage) to start a new NameNode.
  2. Then, I will configure the DataNodes and clients so that they can acknowledge this new NameNode that I have started.
  3. Now the new NameNode will start serving the client after it has completed loading the last checkpointed FsImage (for meta data information) and received enough block reports from the DataNodes to leave the safe mode.
  This takes 30 minutes on an average. On large Hadoop clusters this NameNode recovery process may consume a lot of time and this becomes even a greater challenge in the case of the routine maintenance. This is why we have HDFS HA Architecture and HDFS Federation Architecture which is covered in a separate blog here: https://www.edureka.co/blog/overview-of-hadoop-2-0-cluster-architecture-federation/. Hope this helps. Cheers!
  Reply
Hareesh@Disqus says:
Aug 9, 2015 at 4:51 pm GMT
The following i have questions regarding HDFS and MR
1.Is it possible to store multiple files in HDFS with different block sizes?
2.Is it possible to give whole file as input to mapper?
Thanks
Hareesh A
Reply
- EdurekaSupport says:
  Aug 10, 2015 at 9:53 am GMT
  Hi Hareesha,
  Thank you for reaching out to us.
  Yes it is possible in both situations but it will depend on the data blocks as well as the way in which they are applied. You can get in touch with us for further clarification by contacting our sales team on +91-8880862004 (India) or 1800 275 9730 (US toll free). You can also mail us on sales@edureka.co.
  Reply
Bhupendra Pandey says:
Jun 17, 2015 at 5:56 am GMT
What is blockreport? Why datanodes need to send it to Namenode at regular interval? Doesn’t namenode keep store metadata and block details in namespace at the time of file write?
Reply
- Shaheer Kidwai says:
  Jun 19, 2015 at 1:24 pm GMT
  In some interval of time, the DataNode sends a block report to the
  NameNode. The block report allows the NameNode to repair any divergence that may have occurred between the replica information on the NameNode and on the DataNodes. The Block and Replica Management may use this revised information to enqueue block replication or deletion commands for this or other DataNodes.
  During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The NameNode then
  schedules creation of new replicas of those blocks on other DataNodes.
  Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NameNode’s block
  allocation and load balancing decisions.
  Reply
  - EdurekaSupport says:
    Jun 22, 2015 at 6:50 am GMT
    Thanks for responding to this question Shaheer.
    Reply
  - Ujwala says:
    Jul 2, 2015 at 10:14 am GMT
    if the default heartbeat interval is three seconds, isnt ten minutes too long to conclude that data node is out of service? is it the default number? can this be configured?
    Reply
    - EdurekaSupport says:
      Jul 16, 2015 at 8:59 am GMT
      Hi Ujwala, the default interval of time is 10 minutes and we can’t change it. The Namenode wait for the heartbeat from the Datanode till the interval of time mentioned and if it doesn’t receive the heartbeat then it consider that particular Datanode to be out of service and creates new replicas of those blocks on other Datanodes.
      Reply
Deven Kalra says:
Jan 15, 2015 at 4:42 am GMT
When writing the data into physical blocks in the nodes, if one node fails, does it stop the writing process goes back to the name node, name node re writes nodes to write or I am wrong?
Reply
- EdurekaSupport says:
  Jan 19, 2015 at 8:20 am GMT
  Hi Deven, when writing the data into physical blocks in the nodes, namenode receives heart beat( a kind of signal) from the datanodes which indicates if the node is alive or not. If the namenode does not receive heartbeat within the specific time period then it assumes that the datanode has failed and then writes the data to a different data block.
  Hope this helps!!
  Reply
  - Biswa Bihari panda says:
    Mar 5, 2015 at 4:49 pm GMT
    So it will write from the begining ?
    Reply
    - EdurekaSupport says:
      Mar 10, 2015 at 7:29 am GMT
      Yes, you are right Biswa!!!
      Reply
rishikhurana17 says:
Oct 3, 2014 at 2:51 am GMT
very well explained .. thanx
Reply
- EdurekaSupport says:
  Oct 7, 2014 at 5:37 am GMT
  Thanks Rishi!! Feel free to go through our other blog posts as well: https://www.edureka.co/blog/category/big-data-analytics/
  Reply
Praveen Sharp says:
Sep 3, 2013 at 6:07 am GMT
I’ve read similar things on other blogs. Ill take your word for it. Stay solid! your pal.
Hmm, that is some compelling information youve got going! Makes me scratch my head and think. Keep up the good writing!
Reply

1 2 3 Next »

Big Data

Apache Hadoop HDFS Architecture

Apache Hadoop HDFS Architecture

Introduction:

Big Data & Hadoop Full Course – Learn Hadoop In 10 Hours | Hadoop Tutorial For Beginners | Edureka

HDFS Architecture:

NameNode:

Functions of NameNode:

DataNode:

Functions of DataNode:

Secondary NameNode:

Functions of Secondary NameNode:

Blocks:

Replication Management:

Rack Awareness:

Advantages of Rack Awareness:

HDFS Read/ Write Architecture:

HDFS Write Architecture:

1. Set up of Pipeline:

2. Data Streaming:

3. Shutdown of Pipeline or Acknowledgement stage:

HDFS Read Architecture:

Recommended videos for you

Introduction to Big Data TDD and Pig Unit

Streaming With Apache Spark and Scala

What is Apache Storm all about?

5 Things One Must Know About Spark

Distributed Cache With MapReduce

Hive Tutorial – Understanding Hive In Depth

Big Data Processing With Apache Spark

What Is Hadoop – All You Need To Know About Hadoop

Boost Your Data Career with Predictive Analytics! Learn How ?

Hadoop Cluster With High Availability

Python for Big Data Analytics

What is Big Data and Why Learn Hadoop!!!

Is It The Right Time For Me To Learn Hadoop ? Find out.

Introduction to Hadoop Administration

Real-Time Analytics with Apache Storm

Advanced Security In Hadoop Cluster

Webinar: Introduction to Big Data & Hadoop

Apache Spark Will Replace Hadoop ! Know Why

Pig Tutorial – Know Everything About Apache Pig Script

HBase Tutorial – A Complete Guide On Apache HBase

Recommended blogs for you

Infographics: How Big is Big Data?

Hadoop 2.0 – Frequently Asked Questions

Top Apache Kafka Interview Questions To Prepare In 2024

Pig Programming: Create Your First Apache Pig Script

Apache Hadoop HDFS Architecture

What Is Splunk? A Beginners Guide To Understanding Splunk

Spark Java Tutorial : Your One Stop Solution to Spark in Java

Spark SQL Tutorial – Understanding Spark SQL With Examples

Hadoop Developer-Job Responsibilities & Skills

PySpark MLlib Tutorial : Machine Learning with PySpark

CCA and CCP Certifications By Cloudera: All You Need To Know

Jupyter Notebook Cheat Sheet : A Beginner’s Guide to Jupyter Notebook

Helpful Hadoop Shell Commands

Azure Synapse: Unlocking the Power of Your Data

Introduction to Hadoop Job Tracker

Big Data Tutorial: All You Need To Know About Big Data!

Everything About Cloudera Certified Administrator for Apache Hadoop (CCAH)

Brief Introduction to Oozie

Apache Spark with Hadoop – Why it Matters?

Spark MLlib – Machine Learning Library Of Apache Spark

Join the discussion Cancel reply

Trending Courses in Big Data

Azure Data Engineer Certification (DP-203) Co ...

PySpark Course Online Training

Big Data Hadoop Certification Training Course

Apache Spark and Scala Certification Training ...

Apache Kafka Certification Training Course

Leveraging Big Data for Business Intelligence ...

Splunk Certification Training: Power User and ...

ELK Stack Training & Certification

Apache Storm Certification Training

Apache Solr Certification Training

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Apache Hadoop HDFS Architecture