Why Learn Cassandra with Hadoop?
Recommended by 93 users
“Companies are realizing they can mine valuable business intelligence to improve decision making and gain competitive edge. Tools such as Hadoop and Cassandra are making all of this possible and because of it, NoSQL skills at all levels are in extremely high-demand.” – Analysts on TechRepublic
Developed as an in-house project at Facebook to power their Inbox search feature, Cassandra is an Open Source Distributed Database Management System. It was released as an open source project on Google Code in 2008 and has subsequently become a top-level project at the Apache Software foundation since 2010.
Cassandra is the next BIG Thing:
- Apache Cassandra is designed to handle humongous amount of data (in terms of Velocity, Volume and Variety) across numerous commodity servers assuring high availability and providing no SPOF (Single Point of Failure).
- Cassandra also offers potent support for clusters spanning multiple data centers. The absence of “Master-slave structure”, like traditional architectures allows for zero impact on the system if a particular node goes down.
- University Of Toronto researchers performing study on NoSQL systems state that in terms of scalability and maximum throughput per node, Cassandra emerges as a clear winner.The main Focus of NoSQL DBMS is to ensure Scalability, Performance and High Availability.Like Most NoSQL DBMS, Cassandra can handle both structured and unstructured data and performs considerably well on the above parameters.
- Cassandra can serve as both real-time Datastore (“the System of Record”) for online/transactional applications and as a read-intensive Database for the Business Intelligence systems. Read our blog post on various advantages offered by Cassandra, for more information.
Why go for Hadoop with Cassandra?
In simple terms, to have:
- Unified workload
- Simpler deployment
When it comes to Hadoop, businesses are not interested in Hadoop’s underlying storage structure, but its cost effective delivering methods for analyzing and processing vast amounts of data. Being able to make decisions from the output of MapReduce, Hive, Pig, Mahout, and other operations is what matters most to these organizations.
Key Points to Remember:
- The Hadoop Distributed File System (HDFS) is one of many different components and projects contained within the Hadoop ecosystem. The Apache Hadoop project defines HDFS as the primary storage system used by Hadoop applications.HDFS can store massive distributed unstructured data sets. Data can be stored directly in HDFS, or it can be stored in a semi-structured format in HBase, which allows rapid record-level data access and is modeled after Google’s BigTable system.Cassandra on the other hand is a non-relational system that uses the BigTable data model, but employs Amazon’s Dynamo scheme for data distribution and clustering.
- Hadoop does many great things, its core MapReduce capabilities are very strong. Industry experts adore Hive and its SQL-like design. However the HDFS file system is extremely complex to set up, has single points of failure, and – according to feedback from major businesses is just not ready to do what they want it to do. Cassandra on the other hand provides all the capabilities of the lower level of the Hadoop stack. Cassandra at the same time also provides low-latency real-time application capabilities in that very infrastructure.
How can Cassandra and Hadoop Work Together?
A number of vendors are offering alternatives to HDFS.A recent paper by an organization called GigaOM provides a high-level overview of how Apache Cassandra File System canbe used to replace HDFS, with minimal programming changes required from a development perspective, and how a number of benefits can be reaped in this process. DataStax, a leading commercial provider for distributions of Cassandra has combined Cassandra with Hadoop and named it Brisk. With Brisk, HDFS is replaced by Cassandra File System.
Advantage of Cassandra – Hadoop Combination:
- One can also implement Cassandra with Hadoop on the same cluster. This means that you can have the best of both worlds.
- Time-based and real-time running under Cassandra applications (real-time being the strength of Cassandra) while batch-based analytics and queries that do not require a timestamp can run on Hadoop. In this kind of ecosystem, HDFS is replaced by Cassandra and this is invisible to the developer. One can reassign dynamically, nodes between the Cassandra and Hadoop environments as is appropriate.
- Cassandra File System removes the single points of failure that are associated with HDFS, namely the NameNode and Job Tracker points of failure that are associated with HDFS.
The idea therefore is to combine Cassandra which pioneers itself at high-volume real-time transaction processing, with Hadoop which excels at more batch-oriented analytical solutions.
Cassandra and the Biggies:
Many organizations across the industry verticals are embracing Cassandra to achieve various business objectives. Some prominent ones are:
- Netflix – Uses Cassandra as their back-end database for their streaming services.
- Cisco’s WebEx – Uses Cassandra to store user feed and activity in near real time.
- SoundCloud – Uses Cassandra to store the dashboard of their users.
- IBM– Has done research in building a scalable email system based on Cassandra
Job Titles Involving Hadoop and Cassandra Skills:
Study by Simplyhired shows that Cassandra jobs are in high demand due to its high adoption rate in the industry especially in the last couple of years. And the future looks very promising.
Let’s look at some of the job titles involving Hadoop-Cassandra skills and their salaries mentioned in Indeed.com:
- Data Architect: This position nets an average salary of $107,000. Data architects are required to have some experience in creating data models, data warehousing, analyzing data, and data migration
- Data Scientist: They gather data, analyze it, present the data visually, and use the data to make predictions/forecasts. The average salary for a data scientist is $104,000
- Systems Engineer: The average salary for systems engineers is $89,000.
- DBA: DBA’s make an average of over $100,000.
- Software Application developer: Software developers make an average salary of $107,000 and application developers $93,000.People with these skills can get ample freelance work or can launch their own startup if they have the entrepreneurial spirit.