How to analyze block placement on datanodes and rebalancing data across Hadoop nodes

Question

I want to see that how block placement on datanodes and rebalancing data happens across Hadoop nodes? What is the case when it is required?

nitinrawat895 · Answer 1 · Jun 21, 2018

HDFS provides a tool for administrators i.e. BALANCER that analyzes block placement and rebalances data across the Datanode. This can be done manually by giving command (bin/Hadoop balancer) or can be set to run when disk usage reaches a particular percentage (bin/Hadoop balancer –threshold % ).

BALANCER moves blocks from over utilized nodes to under utilized nodes, thus making sure that data is evenly distributed across nodes. It ensures balanced data density across cluster.

When a new data node joins hdfs cluster, it does not hold much data. So any map task assigned to the machine most likely does not read local data, thus increasing the use of network bandwidth. On the other hand, when some data nodes become full, new data blocks are placed on only non-full data nodes, thus reducing their read parallelism. If a data node fails, data needs to be re replicated on existing nodes which might cause data density to be higher on some nodes.

Hope it will answer your query to some extent.