How to pass large records to map reduce tasks

Question

I'm trying to use map/reduce to process large amounts of binary data. The application is characterized by the following: the number of records is potentially large, such that I don't really want to store each record as a separate file in HDFS (I was planning to concatenate them all into a single binary sequence file), and each record is a large coherent (i.e. non-splittable) blob, between one and several hundred MB in size. The records will be consumed and processed by a C++ executable. If it weren't for the size of the records, the Hadoop Pipes API would be fine: but this seems to be based around passing the input to map/reduce tasks as a contiguous block of bytes, which is impractical in this case.

I'm not sure of the best way to do this. Does any kind of buffered interface exist that would allow each M/R task to pull multiple blocks of data in manageable chunks? Otherwise I'm thinking of passing file offsets via the API and streaming in the raw data from HDFS on the C++ side.

I'd like to have any opinions from anyone who's tried anything similar - I'm pretty new to hadoop.

Frankie · Answer 1 · Sep 25, 2018

Hadoop is not designed for records about 100MB in size. You will get OutOfMemoryError and uneven splits because some records are 1MB and some are 100MB. By Ahmdal's Law your parallelism will suffer greatly, reducing throughput.

I see two options. You can use Hadoop streaming to map your large files into your C++ executable as-is. Since this will send your data via stdin it will naturally be streaming and buffered. Your first map task must break up the data into smaller records for further processing. Further tasks then operate on the smaller records.

If you really can't break it up, make your map reduce job operate on file names. The first mapper gets some file names, runs them thorough your mapper C++ executable, stores them in more files. The reducer is given all the names of the output files, repeat with a reducer C++ executable. This will not run out of memory but it will be slow. Besides the parallelism issue you won't get reduce jobs scheduled onto nodes that already have the data, resulting in non-local HDFS reads.