How does Hadoop process data which is split across multiple boundaries in an HDFS

Question

I learnt that the records that get stored into HDFS are divided into blocks of 128MB each by default, What if a record is sized more than 128MB and gets divided and stored on HDFS, then, how does Hadoop map the location of the divided files? How does Mapper get confirmation that the first block is incomplete? How does it understand the end of the file?

ravikiran · Answer 1 · Jul 1, 2019

I found some comments: from the Hadoop source code of LineRecordReader.java the constructor:

// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
  start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;

Hadoop will read one extra line for each split(at the end of the current split, read next line in next split), and if not first split, the first line will be thrown away. so that no line record will be lost and incomplete