The above video is the recorded session of the webinar on the topic “Introduction to Hadoop”, which was conducted on 8th August’14.
Introduction to Hadoop
Big Data is a term for collection of data sets so large and complex that it becomes difficult to process using hands-on database management tools or traditional data processing applications.
Big Data has now become a popular term to describe the explosion of data and Hadoop has become synonymous with Big Data. Doug Cutting, created Apache Hadoop for this very reason. Hadoop has now become the de facto standard for storing, processing and analyzing hundreds of terabytes, and even petabytes of data. Hadoop allows distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that store and process data.
The above video covers the following topics in detail:
- What is Big Data
- Traditional Warehouse Vs Hadoop
- Why Should you Learn Hadoop and Related Technologies
- Jobs and Trends in Big Data
- Hadoop Architecture and Ecosystem
Why Should you Learn Hadoop & Related Technologies?
- Unstructured Data is Exploding – Digital universe has grown by 62% last year, to 800k petabytes and will grow further to 1.2 zettabytes by the end of this year.
- Big Data Challenges – Increasing volume of data from various sources with different data types are imposing huge challenges.
Big Data Customer Scenarios:
Here are some use cases of Big Data in Retail, Banking and Financial sectors:
Banking and Financial Services:
- Modeling True Risk
- Threat Analysis
- Fraud Detection
- Trade Surveillance
- Credit Scoring and Analysis
- Point of Sales Transaction Analysis
- Customer Churn Analysis
- Sentiment Analysis
This video includes a case study where the usage of Hadoop by Sears has been discussed. Sears was previously using traditional systems such as Oracle Exadata, Teradata and SAS to store and process the customer activity and sales data. On adapting Hadoop, Sears gained valuable advantages like :
- Insights in to data provided valuable business advantage
- Key early indicators that means fortune to business
- Precise analysis with more data
Limitations of Existing Data Analytics Architecture and How Hadoop Overcomes it:
The video has a step by step explanation of the flow of data and limitation faced by it in existing data analytics architecture and how Hadoop over comes it. Hadoop provides a solution where a combined storage computer layer is utilized. As a result, Sears moved to 300 node Hadoop cluster to keep 100% of its data for processing rather than the meager 10% that was available in the existing non-Hadoop solutions.
Why Move to Hadoop?
The following reasons make it pretty clear as to why one must move to Hadoop.
- Allows distributed processing of large sets of data across clusters of computers using simple programming model.
- Has become the de facto standard for storing, processing and analyzing hundreds of terabytes and petabytes of data.
- Cheaper to use, in comparison with other traditional proprietary technologies.
- Handles all types of data from disparate systems.
Hadoop – Growth and Job Opportunities:
“We’ve heard it’s a fad, heard it’s hyped and heard it’s fleeting, yet it’s clear that data professionals are in demand and well paid. Tech professionals who analyse large data streams and strategically impact the overall business goals of a firm have an opportunity to write their own ticket.” said Alice Hill, Managing Director of Dice.com.
As per the 2012-13 Salary Survey by Dice, a leading career site for technology and engineering professionals:
- Big Data jobs are having positive, disproportionate impact on salaries.
- Professionals with Hadoop, NoSQL and MongoD skills can earn more than $100,000
- Gartner Says Big Data will be creating 4.4 Million IT Jobs Globally to support Big Data, By 2015. Click here to know more about the demand for Hadoop.
Hadoop Ecosystem & Architecture:
Hadoop comprises of two main components:
HDFS – Hadoop Distributed File System – For Storage
- Highly fault-tolerant
- High throughput access to application data
- Suitable for applications that have large data set
- Natively redundant
MapReduce – For Processing
- Software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner
- Splits a task across processors