5 Lessons from Building Event-Driven Systems with Kafka
After working with Kafka across three companies (Paytm, 17Live, and Priority IDC), here are the lessons I wish I knew from day one.
1. Consumer Group Rebalancing Will Bite You
When a consumer joins or leaves a group, Kafka triggers a rebalance. During this time, no messages are processed. In a high-throughput system, even a few seconds of pause can cause a backlog.
Solution: Use CooperativeStickyAssignor instead of the default RangeAssignor. It minimizes partition reassignment during rebalances.
2. Exactly-Once Is Harder Than You Think
Kafka supports exactly-once semantics (EOS), but only within Kafka-to-Kafka flows. The moment you involve an external system (like a database), you need idempotency.
@KafkaListener(topics = "transactions")
public void processTransaction(TransactionEvent event) {
// Idempotency check
if (processedRepo.existsByEventId(event.getId())) {
return; // Already processed
}
// Process and record
}
3. Dead Letter Topics Are Non-Negotiable
Poison messages (messages that repeatedly fail processing) will block your consumer. Always configure a dead letter topic.
4. Schema Evolution Matters
We learned this the hard way — adding a required field to an event broke all consumers. Use Avro or Protobuf with a schema registry for backward compatibility.
5. Monitor Consumer Lag
Consumer lag (the difference between the latest message and the consumer's current position) is the single most important Kafka metric. Set alerts for lag exceeding your SLA threshold.