How Impala is fast compared to Hive in terms of query response

Question

I am querying large CSV data sets present in HDFS using Hive and Impala. I saw that I’m getting better response time with Impala compared to Hive for the queries.

Can anyone tell me some use cases where impala is best suited and where hive is best suited?

How impala is fast in terms of query response when compared to hive?

nitinrawat895 · Answer 1 · Mar 21, 2018

Impala provides faster response as it uses MPP(massively parallel processing) unlike Hive which uses MapReduce under the hood, which involves some initial overheads (as Charles sir has specified). Massively parallel processing is a type of computing that uses many separate CPUs running in parallel to execute a single program where each CPU has it's own dedicated memory. The very fact that Impala, being MPP based, doesn't involve the overheads of a MapReduce jobs viz. job setup and creation, slot assignment, split creation, map generation etc., makes it blazingly fast.

But that doesn't mean that Impala is the solution to all your problems. Being highly memory intensive (MPP), it is not a good fit for tasks that require heavy data operations like joins etc., as you just can't fit everything into the memory. This is where Hive is a better fit.

So, if you need real time, ad-hoc queries over a subset of your data go for Impala. And if you have batch processing kinda needs over your Big Data go for Hive.