How can Hadoop process the records that are split across the block boundaries?

0 votes
Assume a record line is split between two blocks (b1 and b2). The mapper is processing the first block (b1) comes to know that, at the last line, there is no EOL separator and fetches the remaining of the line from the next block of data (b2).

How does the mapper processing the second block (b2) understands that the first record is incomplete and should process starting from the second record in the block (b2)?
Apr 15 in Big Data Hadoop by nitinrawat895
• 10,670 points
105 views

1 answer to this question.

0 votes

First of all, Map Reduce algorithm is not programmed to work on physical memory blocks of the file. It is designed to work on the logical input splits. Each file or data you enter into HDFS splits into a default memory sized block. Input split block size depends on the memory location where the record was written. A record can extend to two Mappers.

HDFS was designed in such a way that it divides files into blocks measuring 128MB each by default and replicates the data before storing the default replication factor is three. Then these blocks are transferred to different nodes in the Hadoop cluster.

HDFS has no regard for the data present in those files. A file can start in A-Block and end of that file can be present in B-Block.

To solve this problem, Hadoop uses a logical representation of the data stored in file blocks, known as input splits. When a MapReduce job is assigned from the client, it calculates the total number of input splits, it understands where the first record in a block starts and where the last record in the block finishes.

In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record.

Image result for mapreduce tutorial

answered Apr 15 by nitinrawat895
• 10,670 points

Related Questions In Big Data Hadoop

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

How does Hadoop accesses the files which are distributed among different boundaries?

Hadoop's MapReduce function does not work on ...READ MORE

answered May 7 in Big Data Hadoop by ravikiran
• 4,560 points
24 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,670 points
3,002 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,670 points
334 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
14,825 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
1,101 views
0 votes
1 answer

How to analyze block placement on datanodes and rebalancing data across Hadoop nodes?

HDFS provides a tool for administrators i.e. ...READ MORE

answered Jun 21, 2018 in Big Data Hadoop by nitinrawat895
• 10,670 points
122 views
+1 vote
1 answer

How can I get the Hadoop Documentation for its particular version?

Hi, You can download all the versions you ...READ MORE

answered Mar 19 in Big Data Hadoop by nitinrawat895
• 10,670 points
636 views