Hadoop streaming is a utility that allows one to run MapReduce jobs on the cluster with any executable. The executable reads the standard input from STDIN and writes output to STDOUT and produces results to standard inputs.
There is an input reader which is responsible for reading the data into value pairs. By using MapReduce we can read and write arbitrary types of data like reading data in CSV format or A-limited format. These input readers have a complete logic in interpreting the input. Hence, if it is a sound file then all that has to be done is to put the logic of reading this particular dot wave file into the appropriate input reader and then run it in MapReduce.
Similarly, if it’s an image, you have to create a proper input format or the input reader which reads this image file from the disk and then produces key value pairs. This is all about an input format. The goal of an input format, ultimately is to convert from the raw data format and read the raw data and convert it into a list of key value pairs which are then put into the map phase.
Coming back to the dotted box, it’s much like the MapReduce job. Then comes the map function – it takes the key value and produces another set of key values. The intermediate data is sent to reduce and this intermediate data is produced through shuffling. Reduce will receive a single key and a list of values associated with that key and then it can finally be written into the output file.
Just like the input reader converts the key value into raw data, the output writer is the opposite. You take key values and produce the format you wish to generate. This is a regular MapReduce job. How can the arbitrary language executable be plugged in. Map is running on individual nodes in the cluster as a single process, as many as map tasks. In order to read from Perl, Python etc run that language in a different process.
Got a question for us? Mention them in the comments section and we will get back to you.