In today’s world, data is the main ingredient of internet applications and typically encompasses the following :
- Page visits and clicks
- User activities
- Events corresponding to logins
- Social networking activities such as likes, shares and comments
- Application-specific metrics (e.g. logs, page load time, performance etc.)
This data can be used to run analytics in real time serving various purposes, some of which are:
- Delivering advertisements
- Tracking abnormal user behaviors
- Displaying search based on relevance
- Showing recommendations based on previous activities
Problem: Collecting all the data is not easy as data is generated from various sources in different formats
Solution: One of the ways to solve this problem is to use a messaging system. Messaging systems provide a seamless integration between distributed applications with the help of messages.
Apache Kafka :
Apache Kafka is a distributed publish subscribe messaging system which was originally developed at LinkedIn and later on became a part of the Apache project. Kafka is fast, agile, scalable and distributed by design.
Kafka Architecture and Terminology :
Topic : A stream of messages belonging to a particular category is called a topic
Producer : A producer can be any application that can publish messages to a topic
Consumer : A consumer can be any application that subscribes to topics and consumes the messages
Broker : Kafka cluster is a set of servers, each of which is called a broker
Kafka is scalable and allows creation of multiple types of clusters.
- Single Node Single Broker Cluster
- Single Node Multiple Broker Cluster
- Multiple Nodes Multiple Broker Cluster
What’s the role of ZooKeeper ?
Each Kafka broker coordinates with other Kafka brokers using ZooKeeper. Producers and Consumers are notified by the ZooKeeper service about the presence of new brokers or failure of the broker in theKafka system.
Single Node Multiple Brokers
Multiple Nodes Multiple Brokers
Kafka @ LinkedIn
LinkedIn Newsfeed is powered by Kafka
LinkedIn notifications are powered by Kafka
Note: Apart from this, LinkedIn uses Kafka for many other tasks like log monitoring, performance metrics, search improvement, among others.
Who else uses Kafka ?
DataSift: DataSift uses Kafka as a collector of monitoring events and to track users’ consumption of data streams in real time
Wooga: Wooga uses Kafka to aggregate and process tracking data from all their Facebook games (hosted at various providers) in a central location
Spongecell: Spongecell uses Kafka to run its entire analytics and monitoring pipeline driving both real time and ETL applications
Loggly : Loggly is the world’s most popular cloud-based log management. It uses Kafka for log collection.
Comparative Study: Kafka vs. ActiveMQ vs. RabbitMQ
Kafka has a more efficient storage format.On an average, each message has an overhead of 9 bytes in Kafka, versus 144 bytes in ActiveMQ
With the wide adoption of Kafka in production, it looks to be a promising solution for solving real world problems. Apache Kafka training can help you get ahead of your peers in a real-time analytics career. Get started with an Apache Kafka tutorial here.
Got a question for us? Please mention it in the comments section and we will get back to you.