Choosing Apache Kafka for Event Streaming

Context and Background

The translation platform was evolving from a monolithic Spring Boot application into a set of focused microservices. The monolith had relied on in-process method calls and shared database tables for communication between modules. As we extracted services — translation processing, billing, notification, quality assurance — we needed a communication backbone that could replace these tight couplings with asynchronous, durable messaging.

Our event volume was significant but not extreme: roughly 50,000 translation events per day at the time of the decision, with projections to reach 500,000 within a year as the platform onboarded enterprise clients with batch translation workloads. More important than raw throughput was the need for durability and replay. Translation jobs are high-value operations, and losing events between services was not acceptable.

We also had a specific technical requirement: multiple services needed to react to the same events independently. When a translation completes, the billing service calculates charges, the notification service alerts the customer, the quality service runs automated checks, and the analytics pipeline records metrics. This fan-out pattern needed to work without tightly coupling the producer to its consumers.

Implementation

Cluster setup on AWS: Deployed a 3-broker Kafka cluster using Amazon MSK (Managed Streaming for Apache Kafka) to reduce operational burden. Configured with KRaft mode to eliminate the ZooKeeper dependency. Chose m5.large instances with 500GB EBS volumes per broker.
Topic design: Defined a topic naming convention (platform.{domain}.{event-type}) and created topics with appropriate partition counts. The platform.translation.completed topic was partitioned by customer ID to maintain per-customer ordering. Set retention to 7 days for standard topics and 30 days for audit-critical topics.
Schema Registry integration: Deployed Confluent Schema Registry alongside the cluster. All events were defined as Avro schemas with backward compatibility enforced. This prevented producers from publishing breaking schema changes that would crash consumers.
Spring Kafka configuration: Created a shared Kafka starter library for all services with standardized producer/consumer configurations, error handling with exponential backoff, and dead-letter topic routing for poison messages. Used @KafkaListener with concurrency matching the partition count for each consumer group.
Monitoring and alerting: Integrated Kafka metrics with CloudWatch via MSK’s built-in metric export. Set up alerts for consumer lag exceeding 10,000 messages, under-replicated partitions, and broker disk usage above 75%.
Gradual migration: Ran the Kafka-based event system in parallel with the old direct database polling approach for 4 weeks. Consumers wrote to a reconciliation table that compared results between the two systems before we cut over fully.

Results

Successfully decoupled 6 microservices with zero message loss over the first 3 months of production operation
Consumer lag stayed consistently under 500 messages during normal operations, with events processed within 200ms of production on average
Event replay capability was used 4 times during the first quarter to reprocess events when deploying new consumer services, saving significant manual data reconciliation effort
The fan-out pattern for translation completion events scaled from 3 to 6 consuming services without any changes to the producer
Reduced inter-service latency compared to the old database polling approach from ~30 seconds (poll interval) to under 500ms end-to-end
MSK operational costs came in at roughly $450/month, which was higher than the equivalent SQS cost would have been, but the replay and stream processing capabilities justified the premium

Context

Decision

Alternatives Considered

AWS SQS with SNS fan-out

RabbitMQ

Apache Kafka

Reasoning

Context and Background

Implementation

Results