21 Sep 2015

Hadoop-A Highly Available And Secure Enterprise Data Warehousing Solution

A Data Warehouse is a data integration and rationalization engine that requires multiple technologies in order to work efficiently. There is a dire need to extract value from Big Data and this is driving organizations to look out for data warehouse providers with advanced capabilities. As a result distributed processing system like Hadoop is in huge demand.

Data warehousing poses its own set of challenges for security and hence requires a flexible and powerful security infrastructure and must operate in an environment that has stringent performance and scalability requirements.

This webinar discusses how Hadoop fulfills this requirement when it comes to security. The above video discusses the following topics:

What is Big Data
Why do Enterprises care about Big Data
Why your DWH needs Hadoop?
Security in Hadoop
How Hadoop maintains high Availability
Data warehousing tools in Hadoop

Before we head on to the topic, let’s look at the challenged faced by traditional Data Warehousing solutions.

What is wrong with our traditional DWH solutions?

Traditionally data warehouses do not contain the current data. This is a huge drawback as the fast pace of business today is making these historical systems less valuable. Businesses today are making decisions based on real-time data and the systems that support those decisions need to be updated. It is only logical that Data Warehouses also begin to use real-time data. The following are the drawbacks of the traditional DWH solution:

Increasingly difficult to scale and copy data from multiple data sources in multiple organizations in multiple locations.
Data owners lose control over their data, raising ownership, security and privacy issues.
Long initial implementation time and associated high cost.
Limited flexibility of use and types of users.
Difficult to accommodate changes in data types and ranges, data source schema, indexes and queries.
Cannot actively monitor changes in data.

When RDBMS makes no sense?

RDBMS is not suitable under the following conditions:

Storing unstructured data like images and videos
Processing images and videos
Storing and processing other large files like PDFs, Excel files
Processing large blocks of natural language text
Processing semi-structured data like CSV, JSON, XML, log files, sensor data
Ad-hoc, exploratory analytics
Integrating data from external sources
Data clean up tasks
Very advanced analytics (machine learning)

Technical challenges that come with RDBMS:

Storage capacity
Storage throughput
Pipeline throughput
Processing power
Parallel processing
System Integration
Data Analysis

So why do enterprises care about Big Data?

According to a survey by CIO, a whopping 80% of the respondents say Big Data is important to their ongoing business operations, with 43% of them saying it is “mission-critical”.

Hadoop – The solution for Big Data problems

How Hadoop differs from RDBMS?

Hadoop can store all types of data in it so that you have the flexibility of analyzing all types of data. You can drill down the Big Data to find even the rare insight which was not possible earlier.

Hadoop is the new DWH solution:

In ETL, the data is first loaded. But Hadoop does ELT and not ETL. There is no need to transform the data beforehand. With ELT you have all the freedom to work with all the data. This is possible because of the availability of cheap storage and distributed HDFS.

Hadoop is the new Data Warehouse for all kind of BI requirements. Hadoop will complement and not replace Data Warehouse and BI infrastructure, providing new flexibility for generating insights as the business requirements keep changing. Considering its superior performance ratio, it can help organizations lower costs while maintaining their existing applications and reporting infrastructure. Hadoop provides an economical platform to offload data transformation cycles, and can also simplify and reduce errors in the process.

Core features of Hadoop:

Maintaining High Availability:

In Distributed Computing, failure is a norm, which means YARN should have an acceptable amount of availability.

Interested to know how to achieve HDFS and YARN High Availability? – Watch the video!

Security:

The Hadoop ecosystem has only partially adopted Kerberos but many services remain unprotected and use trivial authentication systems. The security is maintained through service-level authorization and web proxy capabilities in YARN and through ACL (Access Control Lists). The HDFS implements a permission model for files and directories that share much of the posix model.

Most security tools fail to scale and perform with Big Data environments. The following are the different security risks that might occur:

Insufficient Authentication
No Privacy and No Integrity
Arbitrary Code Execution

Watch the video for the demo on ACL.

Webinar presentation:

Questions asked during the webinar:

1. What are the job roles for developers, analysts and administrators?

You can check out the following links to know more about the responsibilities for these roles:

Hadoop Admin Responsibilities

Hadoop Developer Roles & Responsibilities

2. With YARN is it possible to have different workloads in one design – like Impala and Spark?

Yes, we can. This is the kind of capability provided by Hadoop. MRV2, Hive, HBase, Oozie etc can also be part of the workload.

3. How can you compare Hadoop with Vertica?

Vertica is a relational SQL database system used for read-intensive analytic database applications such as data warehouses and data marts. It is optimized for databases with ad hoc query and OLAP-style workloads that include some update operations. Whereas, Hadoop is the underlying data warehouse where all the data will be stored.

4. Is Sqoop recommended for OLTP to HDFS migration?

Yes, Sqoop is the best tool in Hadoop ecosystem that will help you get the data.

Got a question for us? Please mention them in the comments section and we will get back to you.

Related Posts:

Hadoop Administration Training

Top 5 Hadoop Admin Tasks

Hadoop-A Highly Available And Secure Enterprise Data Warehousing Solution

What is wrong with our traditional DWH solutions?

When RDBMS makes no sense?

Technical challenges that come with RDBMS:

Hadoop – The solution for Big Data problems

Hadoop is the new DWH solution:

Core features of Hadoop:

Webinar presentation:

Questions asked during the webinar:

Recommended blogs for you

Microsoft Fabric vs. Databricks

Copy Activity in Azure Data Factory and Azure Synapse Analytics

Azure Data Factory Vs Databricks

Data Engineer Salary in India

What is a Data Engineer? – A Comprehensive Guide

How to Create a Pipeline in Azure Data Factory Step-by-Step

What is Azure Cosmos DB? – Types, Features, Benefits

What is integration runtime in Azure data factory?

Azure Databricks Architecture Overview

What is Delta Lake?

Azure Synapse vs. Databricks – What Are the Differences?

What is Azure Data Factory – Here’s Everything You Need to Know

Azure Synapse: Unlocking the Power of Your Data

Azure Data Engineer Roadmap in 2025

30+ Azure Data Engineer Interview Questions

Azure Data Engineer Salary in India 2025

What are Kafka Streams and How are they implemented?

What are the Best books for Hadoop?

How to become an Apache Spark Developer?

How to Plan the Capacity of a Hadoop Cluster?

Playlist & Videos

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

Microsoft Fabric DP-700 Certification Trainin ...

PySpark Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Apache Kafka Certification Training Course

Apache Spark and Scala Certification Training ...

ELK Stack Training & Certification

Splunk Certification Training: Power User and ...

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.