How compression works in Hadoop?

Question

In my Map Reduce&#160;job, let us say, I specify the compression for either the map or reduce output to LZO, how does it get compressed? Is the entire data from map or reduce task first obtained without compression&#160; and&#160; then the uncompressed data gets compressed, or does it get incrementally gets&#160; compressed and written. If it gets incrementally compressed and written, then how is it done?

Frankie · Answer

It basically depends on the file type you use. If it is a text file then compression happens at the file level.&#160;But if it is SequenceFile then compression could be at record level or block level.&#160;Note that here block means a buffer in using sequence file and not the hdfs block.If it is block compression then multiple records are compressed into a block at once. Records are added to a block until it reaches a minimum size in bytes. The maximum size of input data to be compressed at a time is calculated by subtracting the maximum overhead of the compression algorithm from the buffer size. The default buffer size is 512 bytes and for compression overhead it's 18 bytes(1% of bufferSize + 12 bytes) for zlib algorithm.&#160;Then a BlockCompressorStream is created with given output-stream and compressor and the compressed data gets written.If you specify compression for the map stage (mapreduce.map.output.compress=true) the intermediate map output data will be compressed using whatever code you&#8217;ve specified (mapreduce.map.ouput.compress.codec=org.apache.hadoop.io.compress.*) and saved to disk at the conclusion of each map task&#8217;s completion (or earlier if the map task exceeds it serialization buffer limit, and begins to spill to the disk). The compressed data is then read from the disk and sent to the appropriate nodes during the Shuffle & Sort stage of your mapreduce job.I hope this answer helps :)

How compression works in Hadoop

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Big Data Hadoop

How to run Hadoop in Docker containers?

How to run a jar file in hadoop?

How to retrieve the list of sql (Hive QL) commands that has been executed in a hadoop cluster?

How to configure secondary namenode in Hadoop 2.x ?

Hadoop Mapreduce word count Program

hadoop.mapred vs hadoop.mapreduce?

hadoop fs -put command?

Hadoop dfs -ls command?

What is -cp command in hadoop? How it works?

How do I print hadoop properties in command line?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES