Hadoop regarding input split

Question

How does the input split work? What is the process?

Omkar · Answer 1 · Dec 27, 2018

Hadoop framework divides a large file into blocks (64MB or 128 MB) and stores these blocks in the slave nodes. HDFS is unaware of the content of the block. Suppose the record crosses the block limit, then a part of the same record is written on one block and the other is written on another block.

Hadoop tracks this split of data by the logical representation of the data known as Input Split. When Map Reduce client calculates the input splits, it actually checks if the entire record resides in the same block or not. If the record overheads and some part of it is written into another block, the input split captures the location information of the next Block and byte offset of the data are needed to complete the record.