How does Hadoop Spark is used for building large analytics report

Question

I have a huge table in relation database, users work with it every day (crud operations and search).

And now there is a new task - have a possibility to build huge aggregate report for a one-two year period on demand. And do it fast. All this table records for last two years are too big to fit in memory, so I should split computations into chunks, right?

I don't want to reinvent the wheel, so my question is, does distributed processing systems like Hadoop are suit for this kind of tasks?

Frankie · Answer 1 · Aug 7, 2018

The non-Hadoop way would be to create a semi aggregate report which you can use for another aggregate. I.e using 30 daily aggregates to create 1 monthly aggregate.

In some cases, it may not be possible so you can pull the data to your spark cluster or such and do your aggregate. Usually, the relational database won't give you the data locality features so you can move the data to some NoSQL database like Cassandra or HBase or elastic-search. Also, a big key question is do you want the answer to be in real time? Unless you go through some effort like job server etc spark or Hadoop jobs are usually batch job. Means you submit the job and get the answer later (spark streaming is an exception.)

I hope this answer helps you :)

answered Aug 7, 2018 by Frankie
• 9,830 points

kurt_cobain · Answer 2 · Aug 7, 2018

The best possible framework for this task is PySpark or Spark. Let me tell you why:
As Spark works on top of Hadoop, we can leverage HDFS for storage and split the data and the in-memory computation along with the lazy evaluation of spark will help you achieve your goal easily.

Hope this helps