Storing collection of images in HDFS

Question

I want to perform analysis on a large collection of images. I want to store this data in HDFS and process it with MapReduce but I also want to process the data directly from HDFS with an interpreted high-level programming language like Python. Which format should I use to store this data in HDFS?

Please help!

nitinrawat895 · Answer 1 · Aug 9, 2018

Using Hadoop Sequence Files, you can store the collection of images in HDFS.
Those are map files that inherently can be read by map reduce applications – there is an input format especially for sequence files – and are splitable by map reduce, so we can have one huge file that will be the input of many map tasks. By using those sequence files we are letting hadoop use its advantages. It can split the work into chunks so the processing is parallel, but the chunks are big enough that the process stays efficient.

Since the sequence file are map file the desired format will be that the key will be text and hold the HDFS filename and the value will be BytesWritable and will contain the image content of the file.

Hope this helps!