How input splits are done when 2 blocks are spread across different nodes

Question

Say, I have a large file that's broken into two HDFS blocks and the blocks are physically saved into 2 different machines. Consider there is no such node in the cluster that locally hosts both the blocks. As I understood in case of TextInputFormat HDFS block size is normally same as the split size. Now since there are 2 splits, 2 map instances will be spawned in 2 separate machines which locally hold the blocks. Now assume that the HDFS text file had been broken in middle of a line to form the blocks. Would hadoop now copy block 2 from 2nd machine into the first machine so it could provide the first line(broken half) from 2nd block to complete the last broken line of the first block?

Gitika · Answer 1 · Dec 7, 2020

Hadoop doesn't copy the blocks to the node running the map task, the blocks are streamed from the data node to the task node (with some sensible transfer block size such as 4kb). So in the example you give, the map task that processed the first block will read the entire first block, and then stream read the second block until it finds the end of line character. So it's probably 'mostly' local.

How much of the second block is read depends on how long the line is - it's entirely possible that a file split over 3 blocks will be processed by 3 map tasks, with the second map task essentially processing no records (but reading all the data from block 2 and some of 3) if a line starts in block 1 and ends in block 3.

Hope this makes sense