POC for Hadoop in real time scenario

0 votes

I have a bit of a problem. I want to learn about Hadoop and how I might use it to handle data streams in real time. As such I want to build a meaningful POC around it so that I can showcase it when I have to prove my knowledge of it in front of some potential employer or to introduce it in my present firm.

I'd also want to mention that I am limited in hardware resources. Just my laptop and me :) I know the basics of Hadoop and have written 2-3 basic MR jobs. I want to do something more meaningful or real world.

Please suggest.

Oct 11, 2018 in Big Data Hadoop by Neha
• 6,280 points
321 views

1 answer to this question.

0 votes

I'd like to point a few things.

If you want to do a POC with just 1 laptop, there's little point in using Hadoop.

Also, as said by other people, Hadoop is not designed for realtime application, because there is some overhead in running Map/Reduce jobs.

That being said, Cloudera released Impala which works with the Hadoop ecosystem (specifically the Hive metastore) to achieve realtime performance. Be aware that to achieve this, it does not generate Map/Reduce jobs, and is currently in beta, so use it carefully.

So I would really advise going at Impala so you can still use an Hadoop ecosystem, but if you're also considering alternatives here are a few other frameworks that could be of use:

  • Druid : was open-sourced by MetaMarkets. Looks interesting, even though I've not used it myself.
  • Storm : no integration with HDFS, it just processes data as it comes.
  • HStreaming : integrates with Hadoop.
  • Yahoo S4 : seems pretty close to Storm.

In the end I think you should really analyze your needs, and see if using Hadoop is what you need, because it's only getting started in the realtime space. There are several other projects which could help you achieve realtime performance.


If you want ideas of projects to showcase, I suggest looking at this link. Her are some examples:

  • Finance/Insurance
    • Classify investment opportunities as good or not e.g. based on industry/company metrics, portfolio diversity and currency risk.
    • Classify credit card transactions as valid or invalid based e.g. location of transaction and credit card holder, date, amount, purchased item or service, history of transactions and similar transactions.
  • Biology/Medicine
    • Classification of proteins into structural or functional classes
    • Diagnostic classification, e.g. cancer tumours based on images
  • Internet
    • Document Classification and Ranking
    • Malware classification, email/tweet/web spam classification
  • Production Systems (e.g. in energy or petrochemical industries)
    • Classify and detect situations (e.g. sweet spots or risk situations) based on realtime and historic data from sensors
answered Oct 11, 2018 by Frankie
• 9,810 points

Related Questions In Big Data Hadoop

0 votes
1 answer

How to create a project for the first time in Hadoop.?

If you want to learn Hadoop framework ...READ MORE

answered Jul 26, 2018 in Big Data Hadoop by Neha
• 6,280 points
69 views
0 votes
1 answer

What Distributed Cache is actually used for in Hadoop?

Basically distributed cache allows you to cache ...READ MORE

answered Apr 2, 2018 in Big Data Hadoop by Ashish
• 2,630 points
146 views
+1 vote
2 answers

How to authenticate username & password while using Connector for Cloudera Hadoop in Tableau?

Hadoop server installed was kerberos enabled server. ...READ MORE

answered Aug 21, 2018 in Big Data Hadoop by Priyaj
• 56,520 points
181 views
0 votes
1 answer

Files for Configuring HDFS in Hadoop 2.2.0?

By default these Hadoop configuration files are ...READ MORE

answered Apr 15, 2018 in Big Data Hadoop by Shubham
• 13,290 points
24 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,670 points
2,650 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,670 points
275 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
13,209 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
972 views
0 votes
1 answer

Which is the Real Time Monitoring tool/API for Hadoop?

If you're using Yarn, there's a rest ...READ MORE

answered Sep 4, 2018 in Big Data Hadoop by Frankie
• 9,810 points
117 views
0 votes
1 answer

How compression works in Hadoop?

It basically depends on the file type ...READ MORE

answered Jul 26, 2018 in Big Data Hadoop by Frankie
• 9,810 points
152 views