Hadoop-A Highly Available And Secure Enterprise Data Warehousing Solution
A Data Warehouse is a data integration and rationalization engine that requires multiple technologies in order to work efficiently. There is a dire need to extract value from Big Data and this is driving organizations to look out for data warehouse providers with advanced capabilities. As a result distributed processing system like Hadoop is in huge demand.
Data warehousing poses its own set of challenges for security and hence requires a flexible and powerful security infrastructure and must operate in an environment that has stringent performance and scalability requirements.
This webinar discusses how Hadoop fulfills this requirement when it comes to security. The above video discusses the following topics:
What is Big Data
Why do Enterprises care about Big Data
Why your DWH needs Hadoop?
Security in Hadoop
How Hadoop maintains high Availability
Data warehousing tools in Hadoop
Before we head on to the topic, let’s look at the challenged faced by traditional Data Warehousing solutions.
What is wrong with our traditional DWH solutions?
Traditionally data warehouses do not contain the current data. This is a huge drawback as the fast pace of business today is making these historical systems less valuable. Businesses today are making decisions based on real-time data and the systems that support those decisions need to be updated. It is only logical that Data Warehouses also begin to use real-time data. The following are the drawbacks of the traditional DWH solution:
Increasingly difficult to scale and copy data from multiple data sources in multiple organizations in multiple locations.
Data owners lose control over their data, raising ownership, security and privacy issues.
Long initial implementation time and associated high cost.
Limited flexibility of use and types of users.
Difficult to accommodate changes in data types and ranges, data source schema, indexes and queries.
Cannot actively monitor changes in data.
When RDBMS makes no sense?
RDBMS is not suitable under the following conditions:
Storing unstructured data like images and videos
Processing images and videos
Storing and processing other large files like PDFs, Excel files
Processing large blocks of natural language text
Processing semi-structured data like CSV, JSON, XML, log files, sensor data
Ad-hoc, exploratory analytics
Integrating data from external sources
Data clean up tasks
Very advanced analytics (machine learning)
Technical challenges that come with RDBMS:
So why do enterprises care about Big Data?
According to a survey by CIO, a whopping 80% of the respondents say Big Data is important to their ongoing business operations, with 43% of them saying it is “mission-critical”.
Hadoop – The solution for Big Data problems
How Hadoop differs from RDBMS?
Hadoop can store all types of data in it so that you have the flexibility of analyzing all types of data. You can drill down the Big Data to find even the rare insight which was not possible earlier.
The above clearly explains the difference between the new approach and the traditional approach.
Hadoop is the new DWH solution:
In ETL, the data is first loaded. But Hadoop does ELT and not ETL. There is no need to transform the data beforehand. With ELT you have all the freedom to work with all the data. This is possible because of the availability of cheap storage and distributed HDFS.
Hadoop is the new Data Warehouse for all kind of BI requirements. Hadoop will complement and not replace Data Warehouse and BI infrastructure, providing new flexibility for generating insights as the business requirements keep changing. Considering its superior performance ratio, it can help organizations lower costs while maintaining their existing applications and reporting infrastructure. Hadoop provides an economical platform to offload data transformation cycles, and can also simplify and reduce errors in the process.
Core features of Hadoop:
- Maintaining High Availability:
In Distributed Computing, failure is a norm, which means YARN should have an acceptable amount of availability.
- NameNode – Single Point of Failure:
- HDFS High Availability:
- YARN High Availability:
Interested to know how to achieve HDFS and YARN High Availability? – Watch the video!
The Hadoop ecosystem has only partially adopted Kerberos but many services remain unprotected and use trivial authentication systems. The security is maintained through service-level authorization and web proxy capabilities in YARN and through ACL (Access Control Lists). The HDFS implements a permission model for files and directories that share much of the posix model.
Most security tools fail to scale and perform with Big Data environments. The following are the different security risks that might occur:
No Privacy and No Integrity
Arbitrary Code Execution
Watch the video for the demo on ACL.
Questions asked during the webinar:
1. What are the job roles for developers, analysts and administrators?
You can check out the following links to know more about the responsibilities for these roles:
2. With YARN is it possible to have different workloads in one design – like Impala and Spark?
Yes, we can. This is the kind of capability provided by Hadoop. MRV2, Hive, HBase, Oozie etc can also be part of the workload.
3. How can you compare Hadoop with Vertica?
Vertica is a relational SQL database system used for read-intensive analytic database applications such as data warehouses and data marts. It is optimized for databases with ad hoc query and OLAP-style workloads that include some update operations. Whereas, Hadoop is the underlying data warehouse where all the data will be stored.
4. Is Sqoop recommended for OLTP to HDFS migration?
Yes, Sqoop is the best tool in Hadoop ecosystem that will help you get the data.
Got a question for us? Please mention them in the comments section and we will get back to you.