Apache Kafka With Spark Streaming: Real-Time Analytics Redefined
Apache projects like Kafka and Spark continue to be popular when it comes to stream processing. Engineers have started integrating Kafka with Spark streaming to benefit from the advantages both of them offer. This webinar discusses the advantages of Kafka, different components and use cases along with Kafka-Spark integration.
The above video includes the following topics. You can check out the presentation at the end of the post:
What is Kafka?
Why do we need Kafka?
How does Kafka work
Which companies are using Kafka?
Kafka and Spark integration
What is Kafka & why do we need it?
Kafka is an open-source message broker project developed by the Apache Software Foundation and is written in Scala. Kafka’s objective is to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka’s design is predominantly based on transaction logs.
Kafka is used in scenarios where its features like scalability, data partitioning, low latency, and the ability to handle large data can be utilized. Basically, Kafka is a good player when it comes to data integration related use cases.
Kafka is adopted in the following use cases:
Transportation of logs
Activity stream in real-time
Time taken to load a web page
Time taken by multiple services while building a web page
Number of requests
Number of hits on a particular page/URL
Why LinkedIn built Kafka?
Kafka was originally developed at LinkedIn and became an Apache project in July 2011. It is used by LinkedIn for applications including log aggregation, queuing, and real-time monitoring and event processing.
Before delving into the reasons behind the creation of Kafka, we must understand the actual processes that happen at LinkedIn:
LinkedIn uses Apache Kafka as a central publish-subscribe log for integrating data between applications, stream processing, and Hadoop data ingestion.
LinkedIn relies heavily on the scalability and reliability of Kafka and a surrounding ecosystem of both open source and internal components. A test was conducted where the Kafka cluster was set up on three machines. The six drives are directly mounted with no RAID (JBOD style). Here are the test results.
How fast is Kafka?
Kafka works up to two million writes/sec on three cheap machines. This was done using three producers on three different machines, 3x async replication and only one producer/machine because NIC was already saturated. It has shown sustained throughput as stored data grows. Below is the result which is slightly different test config than 2M writes/sec mentioned above:
Why is Kafka so fast?
Apache Kafka persists all data to disk, essentially all writes go to the page cache of OS, i.e. RAM and the hardware specifications and OS are tuned in such a way that it is fast.
Kafka efficiently transfers data from page cache to a network socket, through sendfile() system call in Linux.
The Combination of fast writes and fast reads have resulted in fast Kafka!
On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache.
Example: According to loggly.com, which run Kafka on Amazon AWS, “99.99999% of the time the data is coming from disk cache and RAM; only very rarely do we hit the disk.”
“One of our consumer groups (8 threads) which maps a log to a customer can process about 200,000 events per second draining from 192 partitions spread across three brokers.”
How Kafka Works
Here’s a pictorial representation of how Kafka works.
In Kafka, a message stream is defined by a topic, divided into one or more partitions. Replication happens at the partition level and each partition has one or more replicas.
The replicas are assigned evenly to different servers (called brokers) in a Kafka cluster. Each replica maintains a log on disk. Published messages are appended sequentially in the log and each message is identified by a monotonically increasing offset within the log.
When a consumer subscribes to a topic, it keeps track of an offset in each partition for consumption and uses it to issue fetch requests to the broker.
Putting it all together, this is how a Kafka cluster looks like:
Real-time Analysis with Spark Streaming
There are numerous data ingestion sources besides Kafka; like Flume, HDFS, Kinesis, Twitter, etc. The image below shows how an ingestion source/Kafka fits in the streaming application and how the result is stored.
Streaming In Detail:
Kafka adoption and use cases
Here’s a list of companies that have adopted Kafka and their corresponding use cases:
- LinkedIn – Activity streams, operational metrics, data bus
- Netflix – Real-time monitoring and event processing
- Twitter – Part of their Storm real-time data pipelines
- Spotify – Log delivery (from 4h down to 10s), Hadoop
- Loggly – Log collection and processing
- Mozilla – Telemetry data
Other companies like Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber and many more have also adopted Kafka.
Questions asked during the webinar
1. Does Kafka help to build lineage?
No, Kafka will just publish the data.
2. What is the example configuration of the scalable system?
It depends on the size of the entire message which is going to be passed. Plus the RAM configuration available at the nodes where you are going to setup Kafka.
3. Is aggregation possible on Kafka?
Aggregation is not something that Kafka was designed to do.
Got a question for us? Please mention them in the comments section and we will get back to you.