How can Hadoop process the records that are split across the block boundaries?

Question

Assume a record line is split between two blocks (b1 and b2). The mapper is processing the first block (b1) comes to know that, at the last line, there is no EOL separator and fetches the remaining of the line from the next block of data (b2).

How does the mapper processing the second block (b2) understands that the first record is incomplete and should process starting from the second record in the block (b2)?

nitinrawat895 · Answer

First of all,&#160;Map Reduce&#160;algorithm is not programmed to work on physical memory blocks of the file. It is designed to work&#160;on the logical input splits. Each file or data you enter into HDFS splits into a default memory sized block.&#160;Input split block size depends on&#160;the memory location where the record was written. A record can&#160;extend to&#160;two Mappers.HDFS&#160;was designed in such a way that&#160;it divides&#160;files into&#160;blocks&#160;measuring 128MB each by default and replicates the data before storing&#160;the default replication factor is three. Then these blocks are transferred to different&#160;nodes in the Hadoop cluster.HDFS has no regard for&#160;the data present in&#160;those&#160;files. A file&#160;can start&#160;in A-Block&#160;and end of that file&#160;can&#160;be present in B-Block.To solve this problem, Hadoop uses a logical representation of the data stored in file blocks, known as input splits. When a MapReduce job is assigned from the client, it calculates the total number of&#160;input splits,&#160;it understands&#160;where the first record in a block starts and where the last record in the block finishes.In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record.

How can Hadoop process the records that are split across the block boundaries

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Big Data Hadoop

How does Hadoop process records split across block boundaries?

How does Hadoop process data which is split across multiple boundaries in an HDFS?

What is the default location of Hadoop log files & how can I change that location?

How does Hadoop accesses the files which are distributed among different boundaries?

Hadoop Mapreduce word count Program

hadoop.mapred vs hadoop.mapreduce?

hadoop fs -put command?

Hadoop dfs -ls command?

How to analyze block placement on datanodes and rebalancing data across Hadoop nodes?

How can I get the Hadoop Documentation for its particular version?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES