Introduction to Column Family with Cassandra
Recommended by 14 users
A Column family is similar to a table in RDBMS or Relational Database Management System and is a logical division that associates similar data. Basically, in similar data you tend to store some kind of data that are of similar subjects.
For example, an order data is stored in a single column family so you can have an order ID as a row key as well as various columns like the kind of product was brought as a part of that order to be stored in the particular order family. To give certain examples, a user column family consists of a user id as a key. Therefore, one is free to choose an ID and the columns can be ‘name=Kunal’. And the column values are Kunal and Bangalore. There is a second column family called Cricketer column family which enables the storage of cricketer statistics in one column family. For example, you can have Sachin Tendulkar in one column and number of centuries in another. Hence, there can be any number of columns.
Types of Columns
Column Family is nothing but a table which is analogous to a table in an RDBMS world. There are certain types to it, namely;
Static column Family – Static Column family is where the names and data types are defined. So when the column family is created, the option to name the column name and data types is available. It’s called static because the columns remain static and the number of columns available will be known.
Dynamic Column Family – A dynamic column family on the other hand doesn’t define the column names up front and Cassandra’s ability to use arbitrary application and column names to store data is available. So dynamic helps in a way because in an unstructured data, most of the times, dynamic column family helps in taking care of new fields that might have been added later.
If you have a static column family and you want to add a dynamic column family in your code while loading the data, it can be added to a static column family anytime. Cassandra gives the freedom to choose column names.
Difference With RDBMS
Cassandra column family is schema free and is much scalable. Cassandra column family has two attributes – Name and Comparator. So when you have a Cassandra column family, giving it a name becomes mandatory and Comparator is basically a data type for column names. If you don’t specify the comparator, it will assume it to be some default comparator.
Cassandra also has a column of super column families. It will use a super column internally. It is a logical grouping and another level of grouping of columns. So in a user column family, you can have two super columns where we can have users’ personal information and product information.
A column is the smallest increment of data in Cassandra. It has 3 components:
- Time stamp – Used in conflict resolution and the time stamp cannot be edited. It’s an internal mechanism to see when this data or column was updated.
- Expiring columns – An expiry date can be given to a column in order to know that the column will be expiring.
- Counter columns– Counter columns is nothing but to maintain the columns so you can increment and decrement those counter columns.
Super Columns is grouping of all the columns altogether depending on a business need and logical grouping. It adds another level of nesting to the regular column family structure. They comprise a super column family structure.
The primary case for super columns is to denormalize multiple rows from other column families in to a single row allowing for materialized view data retrieval.
Limitations of Super Columns
One limitation is that all sub-columns of a super column family must be de-serialized to read a single sub-column family. Another limitation is that we cannot create secondary indexes on the sub-columns of a super column.
Column Data types
In the image above, there are different data types. The data type for a column value is called a validator. And comparator is the data type for a column. The name and address become a column name. So you can have a date in case of column names. You can also have a time series data by having the date as a column name.
There can be wide rows and several rows. There can be millions of columns and rows available.
Skinny rows have small number of columns having the option of having only limited rows.
It consists of one or more primary key fields. Suppose the name of a city is set as a row key, there maybe a city named in two different states, so you will have to mention which city and which state. This simply means the key has to be stated. Instead of having a simple column key of one type, you can aggregate several values also called components of several types to form one unique column key.
Got a question for us? Mention them in the comments section and we will get back to you.