Published on Jul 19,2018
2.3K Views
Email Post

Apache Hadoop is quickly becoming the technology of choice for organizations investing in big data, powering their next generation data architecture. With Hadoop serving as both a scalable data platform and computational engine, data science is re-emerging as a centerpiece of enterprise innovation, with applied data solutions such as online product recommendation, automated fraud detection and customer sentiment analysis.

In this article, we provide an overview of data science and how to take advantage of Hadoop for large scale data science projects.

How is Hadoop Useful to Data Scientists?

Hadoop is a boon to data scientists. Let’s look at how Hadoop helps in boosting productivity of Data Scientists. Hadoop has a unique capability where all the data can be stored and retrieved from a single place. Through this manner, the following can be achieved:

  • Ability to store all data in the RAW format
  • Data Silo Convergence
  • Data Scientists will find innovative uses of combined data assets.

Hadoop-with-ds11

Key to Hadoop’s Power:

  • Reducing Time and Cost – Hadoop helps in dramatically reducing the Time and Cost of building large scale data products.
  • Computation is co-located with Data – Data and Computation system is codesigned to work together.
  • Affordable at Scale – Can use ‘commodity’ hardware nodes, is self-healing, excellent at batch processing of large datasets.
  • Designed for one write and multiple reads – There are no random Writes and is Optimized for minimum seek on hard drives

Why Hadoop With Data Science?

Reason#1: Explore Large Datasets

The First and foremost reason being one can Explore Large Datasets directly with Hadoop by integrating Hadoop in the Data Analysis flow.

This is achieved by utilizing simple statistics like:

  • Mean
  • Median
  • Quantile
  • Pre-processing: grep, regex

One can also use Ad-hoc Sampling /filtering to achieve  Random: with or without Replacement, Sample by unique Key and K-fold Cross-validation.

Hadoop-with-ds21

Reason#2: Ability to Mine Large Datasets

Learning algorithms with large datasets has its own challenges. The challenges being:

  • Data won’t fit in memory.
  • Learning takes a lot longer time.

When using Hadoop one can perform functions like distribute data across nodes in the Hadoop cluster and implement a distributed/parallel algorithm. For recommendations, one can Alternate Least Square algorithm and for clustering K-Means can be used.

 Hadoop-with-ds3

 

Reason#3:   Large Scale Data Preparation

We all know 80% of Data Science Work involves ‘Data Preparation’. Hadoop is ideal for batch preparation and cleanup of large Datasets.

hadoop-with-ds42

Reason#4: Accelerate Data Driven Innovation:

Traditional data architectures have barriers to speed.  RDBMS uses schema on Write and therefore change is expensive. It’s also a high barrier for data-driven innovation.

hadoop-with-ds5

Hadoop uses “Schema on Read” which means faster time to Innovation and thus adds a low barrier on data driven innovation.

hadoop-with-ds6

Therefore to summarize the four main reasons why we need Hadoop with Data Science would be:

  1. Mine Large Datasets
  2. Data Exploration with full datasets
  3. Pre-Processing At Scale
  4. Faster Data Driven Cycles

hadoop-with-ds7

We therefore see that Organizations can leverage Hadoop to their advantage for mining data and gathering useful results from it. 

Got a question for us?? Please mention them in the comments section and we will get back to you.

Related Posts:

Why Learn Cassandra with Hadoop

Why Big Data Professionals Need to Learn MongoDB?

Big Data and Hadoop Training

Data Science Training

Importance of Data Science With Cassandra

About Author
edureka
Published on Jul 19,2018

Share on

Browse Categories

Comments
2 Comments