How can Hadoop process the records that are split across the block boundaries?

0 votes
Assume a record line is split between two blocks (b1 and b2). The mapper is processing the first block (b1) comes to know that, at the last line, there is no EOL separator and fetches the remaining of the line from the next block of data (b2).

How does the mapper processing the second block (b2) understands that the first record is incomplete and should process starting from the second record in the block (b2)?
Apr 15 in Big Data Hadoop by nitinrawat895
• 9,030 points
54 views

1 answer to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

First of all, Map Reduce algorithm is not programmed to work on physical memory blocks of the file. It is designed to work on the logical input splits. Each file or data you enter into HDFS splits into a default memory sized block. Input split block size depends on the memory location where the record was written. A record can extend to two Mappers.

HDFS was designed in such a way that it divides files into blocks measuring 128MB each by default and replicates the data before storing the default replication factor is three. Then these blocks are transferred to different nodes in the Hadoop cluster.

HDFS has no regard for the data present in those files. A file can start in A-Block and end of that file can be present in B-Block.

To solve this problem, Hadoop uses a logical representation of the data stored in file blocks, known as input splits. When a MapReduce job is assigned from the client, it calculates the total number of input splits, it understands where the first record in a block starts and where the last record in the block finishes.

In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record.

Image result for mapreduce tutorial

answered Apr 15 by nitinrawat895
• 9,030 points

Related Questions In Big Data Hadoop

0 votes
1 answer
0 votes
1 answer

How does Hadoop accesses the files which are distributed among different boundaries?

Hadoop's MapReduce function does not work on ...READ MORE

answered May 7 in Big Data Hadoop by ravikiran
• 1,460 points
10 views
0 votes
1 answer
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 9,030 points
1,661 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 9,030 points
130 views
0 votes
10 answers

hadoop fs -put command?

copy command can be used to copy files ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Sujay
8,049 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
561 views
0 votes
1 answer

How to analyze block placement on datanodes and rebalancing data across Hadoop nodes?

HDFS provides a tool for administrators i.e. ...READ MORE

answered Jun 21, 2018 in Big Data Hadoop by nitinrawat895
• 9,030 points
73 views
+1 vote
1 answer

How can I get the Hadoop Documentation for its particular version?

Hi, You can download all the versions you ...READ MORE

answered Mar 19 in Big Data Hadoop by nitinrawat895
• 9,030 points
22 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.