What is Apache Kafka?
Apache Kafka is an open-source, distributed event streaming platform developed by LinkedIn and later open-sourced in 2011. It is primarily used for building real-time data pipelines and streaming applications. Kafka uses a publish-subscribe model and provides a durable messaging system that is highly scalable and fault-tolerant.
Key Characteristics:
- Asynchronous Communication: Multiple producers can send data to a topic while multiple consumers subscribe and consume data independently.
- Durability: Messages are stored on disk with a configurable time-to-live (TTL), allowing multiple consumers to process them independently.
- Ordering and Fault Tolerance: Kafka preserves the order of messages within a partition and replicates data across brokers.
Real-World Example:
In an e-commerce application:
- A payment service produces transaction records.
- A fraud detection service and a notification service consume the same topic to take appropriate actions (e.g., alerting or flagging transactions).
Kafka Architecture Overview
Kafka’s architecture revolves around a clustered environment composed of the following components:
1. Kafka Cluster
- A Kafka cluster consists of multiple Kafka Brokers (servers) managed centrally.
2. Kafka Broker
- A broker is responsible for storing data, handling producer and consumer requests, and replicating messages.
- One broker acts as the leader for a partition, while others act as followers.
3. Topics
- A topic is a logical channel to which records are published.
- Topics are split into multiple partitions to enable scalability.
4. Partitions
- Each partition is a sequence of records ordered and immutable.
- Kafka replicates partitions to ensure data availability.
Diagram: Kafka Topic Partitioning
+-------------+ +-------------+
Producer --> | Partition 0 | <-- | Consumer |
+-------------+ +-------------+
Producer --> | Partition 1 | <-- | Consumer |
+-------------+ +-------------+
5. Offset
- An offset is a unique ID for each message in a partition.
- Kafka uses offsets to track how much of the log a consumer has read.
6. Producers
- Producers create a ProducerRecord specifying:
- Topic (mandatory)
- Message content (mandatory)
- Partition, Key, Headers (optional)
7. Consumers
- Consumers subscribe to topics, process messages, and commit offsets.
- They may belong to Consumer Groups to distribute load.
8. Zookeeper
- Coordinates Kafka brokers
- Manages:
- Leader election
- Metadata (topics, partitions)
- Broker registration
Why Use Kafka?
1. Scalability
Kafka scales horizontally via more brokers and partitions.
2. Durability
Messages remain even after consumption (until TTL), allowing for multi-subscriber models.
3. Real-Time Processing
Kafka enables low-latency streaming due to fast disk-based log reads and offset management.
4. High Throughput
Kafka can handle millions of messages per second, making it ideal for high-traffic scenarios.
5. Retention Policy
Kafka retains data for a configurable period, ensuring availability even during consumer outages.
6. Dynamic Configuration
Topics and partitions can be dynamically updated.
7. Open Source
Strong community support and wide adoption by companies like LinkedIn, Netflix, Uber, etc.
Role of Zookeeper in Kafka
Zookeeper acts as a central coordinator for the Kafka cluster.
Responsibilities:
- Leader Election: Chooses a broker as partition leader in case of failure.
- Broker Registration: Maintains list of active brokers.
- Metadata Management: Maintains topic-partition mappings.
- Consumer Group Management: Tracks offsets for recovery.
- Health Monitoring: Checks broker status and triggers elections when needed.
Kafka Without Zookeeper
Kafka Evolution:
- Prior to Kafka 2.8.0, Zookeeper was mandatory.
- Post 2.8.0, KIP-500 initiative allows running Kafka without Zookeeper.
Benefits:
- Simplified architecture
- Fewer moving parts
- Improved performance and reliability
Current Status:
- Kafka 4.x is expected to be fully Zookeeper-free, but this is still in development and not yet production-ready.
Diagram: Kafka Evolution Without Zookeeper
[Old Architecture] [New Architecture] Kafka <--> Zookeeper => Kafka (Self-managed metadata)
Summary Table
| Component | Description |
|---|---|
| Producer | Publishes messages to topics |
| Consumer | Subscribes to topics and processes messages |
| Broker | Kafka server handling requests and data storage |
| Topic | Logical channel for messages |
| Partition | Unit of parallelism and ordering within a topic |
| Offset | Unique ID per message per partition |
| Zookeeper | Cluster coordinator (until 2.8) |