Clarification of some Hadoop Concepts

Question

I'm using hadoop to process a video using HVPI, an open-source interface. However, the implementation of the inputsplit, more precisely in the isSplitableobContext (context, Path file)method returns a false. By default, this method returns true but in the current implementation, there is a reason to return afalse. If this method returns false I will only have one map task. If I am not wrong, hadoop allocates for each input split a container that corresponds to the computational resources of a certain node of the network where a map task is executed and this node should preferably contain the data that will process. If I have a false I will only have an input split and consequently, just one map task and this map task will run only on a cluster node. The big question is how an only map task take advantage of all the CPU resources of a cluster and not just a single container on a single node?

Frankie · Answer 1 · Sep 5, 2018

ets try to understand what is the problem . 
1. One takes a file and divides it into fileSplits. 
2. Each split is consumed by one mapper. 
3. How do you make sure a record in the file is not split across two file splits. 
4. A record cant be ignored nor read partially. 
5. A InputFormat takes care of carefully splitting the file and handling situations when a record is split at the boundary of file splits. 
6. Hadoop has varios inpuit formats like TextInputFormat, KeyValueTextInputFormat

Try to find an input format which can be used for your video files or write one yourself. FileInputFormat is the base class for all.

Hope this answer helps :)