How does Hadoop process data which is split across multiple boundaries in an HDFS

0 votes
I learnt that the records that get stored into HDFS are divided into blocks of 128MB each by default, What if a record is sized more than 128MB and gets divided and stored on HDFS, then, how does Hadoop map the location of the divided files? How does Mapper get confirmation that the first block is incomplete? How does it understand the end of the file?
Jul 1, 2019 in Big Data Hadoop by nitinrawat895
• 11,380 points
299 views

1 answer to this question.

0 votes

I found some comments: from the Hadoop source code of LineRecordReader.java the constructor: 

// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
  start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;

Hadoop will read one extra line for each split(at the end of the current split, read next line in next split), and if not first split, the first line will be thrown away. so that no line record will be lost and incomplete

answered Jul 1, 2019 by ravikiran
• 4,620 points

Related Questions In Big Data Hadoop

+1 vote
1 answer

How does Hadoop process records split across block boundaries?

Interesting question, I spent some time looking ...READ MORE

answered Dec 7, 2020 in Big Data Hadoop by Gitika
• 65,950 points
114 views
0 votes
1 answer

How can Hadoop process the records that are split across the block boundaries?

First of all, Map Reduce algorithm is not programmed ...READ MORE

answered Apr 15, 2019 in Big Data Hadoop by nitinrawat895
• 11,380 points
1,590 views
0 votes
1 answer

How to delete a directory from Hadoop cluster which is having comma(,) in its name?

Just try the following command: hadoop fs -rm ...READ MORE

answered May 7, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
1,510 views
0 votes
1 answer

What is Modeling data in Hadoop and how to do it?

I suggest spending some time with Apache ...READ MORE

answered Sep 19, 2018 in Big Data Hadoop by Frankie
• 9,810 points
920 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
7,912 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
1,333 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
62,917 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
3,004 views
0 votes
1 answer

How does Hadoop accesses the files which are distributed among different boundaries?

Hadoop's MapReduce function does not work on ...READ MORE

answered May 7, 2019 in Big Data Hadoop by ravikiran
• 4,620 points
126 views
0 votes
1 answer

How does data gets split in Sqoop?

I will drop the answer in the ...READ MORE

answered Jul 16, 2019 in Big Data Hadoop by ravikiran
• 4,620 points
4,716 views