Kafka Architecture Overview
Main Components:
- Producer: Generates and sends messages to Kafka topics.
- Consumer: Retrieves messages from Kafka topics.
- Cluster: Composed of multiple brokers for scalability and fault tolerance.
- Zookeeper: Coordinates Kafka brokers, handles metadata, and leader election.
Hierarchical Structure:
Kafka Cluster
├── Brokers
├── Topics
├── Partitions
├── Offsets (Each holds a message)
Key Terminologies
1. Topic
- Logical feed name or category to which messages are sent.
- Analogy: Like a folder where each message is a file.
- Supports multiple producers and consumers.
- Messages persist for a configurable TTL (Time-To-Live), not deleted after consumption.
2. Partition
- Topics are split into partitions (like subfolders).
- Messages are written and stored in a strict order.
- Each message has a unique offset (sequential ID).
- Parallel Consumption: Different consumers can read from different partitions.
- Replication:
- Implemented at partition level.
- Each partition has:
- Leader: Handles all reads/writes.
- Followers: Replicate data from leader.
- If leader fails, a follower takes over.
3. Offset
- Unique identifier for each message within a partition.
- Messages are read in sequential order by offset.
- Helps consumers track what they have read.
4. Broker (Kafka Server/Node)
- Kafka server that handles read/write operations.
- Stores data to disk.
- Enables load balancing and fault tolerance through clustering.
- Single broker lacks replication/fault-tolerance capability.
5. Kafka Cluster
- Group of brokers, topics, and partitions.
- Provides scalability, redundancy, and fault tolerance.
Producer Internals
- Sends data to Kafka topics.
- Leader Discovery: Identifies partition leader before sending.
- Partition Assignment:
- Uses key hashing to assign messages to partitions.
- Sequentially appends messages to offsets in a partition.
- Tip: Avoid using same key for all messages to prevent partition imbalance.
Consumer Internals
- Pulls messages from Kafka topics.
- Offset Management:
- Maintains read offset per consumer.
- Helps avoid duplication or data loss.
- Consumer Group:
- Consumers with same group ID form a group.
- One partition per consumer within a group.
- Parallel Consumption: Multiple consumers can read different partitions.
- Pull Model: Consumers actively pull data (Kafka doesn’t push).
- Resilience: Consumers can reset offset to reprocess messages.
Zookeeper in Kafka
- Manages broker metadata and coordinates the Kafka cluster.
- Functions:
- Broker registration and de-registration.
- Leader election for partitions.
- Failure notifications to producers/consumers.
- Requirement: Kafka cannot function without Zookeeper.
- Deployment: Should run with an odd number (e.g., 3) Zookeeper nodes.
- Note: End users don’t interact directly with Zookeeper.
Illustrative Diagram of Kafka Architecture
+------------------+
| Zookeeper |
+------------------+
↑
+-----------+-----------+
| | |
+------------+ +------------+ +------------+
| Broker 1 | | Broker 2 | | Broker 3 |
+------------+ +------------+ +------------+
↑ ↑ ↑
Topic A Topic A Topic B
Partition 0 Partition 1 Partition 0
Producer ---> Broker (writes to topic partition)
Consumer ---> Broker (reads from topic partition)
Analogies for Understanding
- Topic as Folder → Messages as Files
- Partition as Subfolder → Stores files (messages) in order
- Offset as Line Number → Helps reader (consumer) track progress
- Broker as Post Office → Receives and stores mail (messages)
- Producer as Sender → Sends letters (data)
- Consumer as Receiver → Picks up letters from mailbox (broker)
- Zookeeper as City Coordinator → Keeps post offices running smoothly
Interview Tips
- Always describe data flow: Producer → Broker → Consumer.
- Emphasize offset management for data reliability.
- Highlight partitioning for scalability.
- Understand and explain consumer groups and replication clearly.
- Know the role of Zookeeper and upcoming KRaft mode (if asked about modern Kafka versions).
These concepts form the backbone of any Kafka-based distributed messaging architecture and are crucial for interviews in backend or data engineering roles.