Choosing Apache Kafka for Event Streaming
Context
As the translation platform transitioned from a monolithic architecture to microservices, we needed a reliable inter-service communication mechanism that could handle high-throughput event streaming, ensure message durability, and support event replay for debugging and reprocessing.
Decision
Adopt Apache Kafka as the primary event streaming platform for inter-service communication
Alternatives Considered
AWS SQS with SNS fan-out
- Fully managed, zero operational overhead
- Already integrated with our AWS infrastructure (ECS, Lambda)
- Pay-per-message pricing is cost-effective at lower volumes
- Native dead-letter queue support
- No message replay capability -- once consumed, messages are gone
- Limited ordering guarantees (FIFO queues have throughput caps)
- Fan-out patterns with SNS add complexity and latency
- No built-in stream processing capabilities
RabbitMQ
- Mature and well-understood message broker
- Flexible routing with exchanges and bindings
- Good Spring Boot integration via Spring AMQP
- Lower resource footprint for smaller workloads
- Not designed for high-throughput event streaming
- No native event replay or log compaction
- Persistence can become a bottleneck at high message rates
- Operational complexity of clustering and mirrored queues
Apache Kafka
- Built for high-throughput event streaming with durable log
- Message replay allows reprocessing and debugging
- Partitioning provides natural parallelism and ordering
- Strong ecosystem: Kafka Streams, Connect, Schema Registry
- Excellent Spring Boot integration via Spring Kafka
- Higher operational complexity than managed SQS
- Requires ZooKeeper (or KRaft) cluster management
- Steeper learning curve for the team
- Higher baseline infrastructure cost
Reasoning
Kafka was chosen because our translation pipeline generates events that multiple downstream services need to consume independently -- translation completed events trigger billing, notification, quality scoring, and analytics services simultaneously. Kafka's consumer group model handles this fan-out natively without the SNS+SQS plumbing. The replay capability proved invaluable during the migration period, allowing us to reprocess events when new services came online. We also anticipated needing stream processing for real-time translation analytics, which Kafka Streams provides without additional infrastructure.
Context and Background
The translation platform was evolving from a monolithic Spring Boot application into a set of focused microservices. The monolith had relied on in-process method calls and shared database tables for communication between modules. As we extracted services — translation processing, billing, notification, quality assurance — we needed a communication backbone that could replace these tight couplings with asynchronous, durable messaging.
Our event volume was significant but not extreme: roughly 50,000 translation events per day at the time of the decision, with projections to reach 500,000 within a year as the platform onboarded enterprise clients with batch translation workloads. More important than raw throughput was the need for durability and replay. Translation jobs are high-value operations, and losing events between services was not acceptable.
We also had a specific technical requirement: multiple services needed to react to the same events independently. When a translation completes, the billing service calculates charges, the notification service alerts the customer, the quality service runs automated checks, and the analytics pipeline records metrics. This fan-out pattern needed to work without tightly coupling the producer to its consumers.
Implementation
-
Cluster setup on AWS: Deployed a 3-broker Kafka cluster using Amazon MSK (Managed Streaming for Apache Kafka) to reduce operational burden. Configured with KRaft mode to eliminate the ZooKeeper dependency. Chose
m5.largeinstances with 500GB EBS volumes per broker. -
Topic design: Defined a topic naming convention (
platform.{domain}.{event-type}) and created topics with appropriate partition counts. Theplatform.translation.completedtopic was partitioned by customer ID to maintain per-customer ordering. Set retention to 7 days for standard topics and 30 days for audit-critical topics. -
Schema Registry integration: Deployed Confluent Schema Registry alongside the cluster. All events were defined as Avro schemas with backward compatibility enforced. This prevented producers from publishing breaking schema changes that would crash consumers.
-
Spring Kafka configuration: Created a shared Kafka starter library for all services with standardized producer/consumer configurations, error handling with exponential backoff, and dead-letter topic routing for poison messages. Used
@KafkaListenerwith concurrency matching the partition count for each consumer group. -
Monitoring and alerting: Integrated Kafka metrics with CloudWatch via MSK’s built-in metric export. Set up alerts for consumer lag exceeding 10,000 messages, under-replicated partitions, and broker disk usage above 75%.
-
Gradual migration: Ran the Kafka-based event system in parallel with the old direct database polling approach for 4 weeks. Consumers wrote to a reconciliation table that compared results between the two systems before we cut over fully.
Results
- Successfully decoupled 6 microservices with zero message loss over the first 3 months of production operation
- Consumer lag stayed consistently under 500 messages during normal operations, with events processed within 200ms of production on average
- Event replay capability was used 4 times during the first quarter to reprocess events when deploying new consumer services, saving significant manual data reconciliation effort
- The fan-out pattern for translation completion events scaled from 3 to 6 consuming services without any changes to the producer
- Reduced inter-service latency compared to the old database polling approach from ~30 seconds (poll interval) to under 500ms end-to-end
- MSK operational costs came in at roughly $450/month, which was higher than the equivalent SQS cost would have been, but the replay and stream processing capabilities justified the premium