Hadoop on cassandra database

Question

I am using Cassandra to store my data and hive to process my data. I have 5 machines on which i have set up cassandra and 2 machines I use as analytics node(where hive runs) So I want to ask is does hive do map reduce on just two machines(analytics nodes) and brings data there or it moves the process/computation to 5 cassandra nodes as well and process/compute the data on those machines.(What I know is in hadoop, process moves to data not data to process).

nitinrawat895 · Answer 1 · Mar 26, 2018

Regarding your question - there is a tradeof:

a) If you run Hadoop / Hive on separate nodes you loose data locality and thereof your data throughput is limited by your network bandwidth.
b) If you run hadoop / Hive on the same nodes as cassandra runs - you can get data locality but MapReduce processing behind hive queries might clogg your network (and other resources) and thereof affect your quality of service from cassandra.

My suggestion will be to have separate hive nodes if performance of your cassandra cluster are critical.
If your cassandra is mostly used as a data store and do not handle real-time requests - then running hive on each node will improve performance and hardware utilization.