Splitting the Monolith into Microservices

Context and Background

By mid-2023, the translation platform monolith had grown to over 300,000 lines of Java code across roughly 40 Spring components, all deployed as a single application on AWS ECS. What started as a reasonable monolith had accumulated responsibilities far beyond its original scope: translation job orchestration, real-time file format conversion, billing and invoicing, email and webhook notifications, user and organization management, and analytics data aggregation.

The pain was felt across the entire team. Deployments required a full restart of the application, taking 30+ minutes including health check convergence. A memory leak in the analytics aggregation module caused OOM kills that took down translation processing for paying customers. Two developers working on unrelated features — say, adding a new notification channel and optimizing translation memory matching — frequently had merge conflicts because the shared service layer touched everything. Scaling for peak translation loads meant scaling the entire application, including modules like billing that needed minimal resources.

The final catalyst was a production incident where a billing calculation bug corrupted invoice data, and the emergency hotfix required redeploying the entire platform during business hours, causing a 15-minute translation service interruption. The team agreed: we needed service isolation, and we needed it without a risky rewrite.

Implementation

Domain mapping: Conducted an event storming session to identify bounded contexts within the monolith. Identified 6 core domains: Translation Processing, User Management, Billing, Notifications, File Storage, and Analytics. Mapped the existing code modules to these domains and identified the coupling points (shared database tables, direct method calls, shared DTOs).
API gateway introduction: Deployed an API gateway (Spring Cloud Gateway) in front of the monolith. All external traffic was routed through the gateway, which allowed us to redirect specific routes to new microservices without changing client URLs. This was the foundation of the strangler fig pattern.
Billing service extraction (first): Extracted billing as the first microservice because it was the most painful coupling point and had relatively clean domain boundaries. Created a new Spring Boot service with its own PostgreSQL schema. Defined Kafka events for translation.completed and subscription.changed that the billing service consumed. Ran both old and new billing in parallel for 2 weeks, comparing outputs for reconciliation.
Notification service extraction: The notification module was next — it was a natural consumer of events from other domains with minimal write-back. Extracted email, webhook, and SMS notification logic into a dedicated service consuming Kafka events. This service was the first to run on a smaller ECS task definition, demonstrating the cost benefits of independent scaling.
Shared database decomposition: The hardest part was untangling the shared database. Used the database-per-service pattern with a transition period where services had read-only access to the monolith database via database views, while writes went through APIs. Gradually migrated data ownership until each service controlled its own tables.
Translation processing extraction (last): Saved the core translation processing for last because it was the most complex and highest-risk domain. By this point, the team had 6 months of microservice operational experience. Extracted it with its own optimized PostgreSQL schema, Redis caching layer, and Kafka-based job orchestration.

Results

Deployment frequency increased from roughly once per week (due to risk and coordination overhead) to multiple independent deployments per day across services
Mean time to recovery improved from approximately 30 minutes (full monolith restart) to under 5 minutes (individual service restart or rollback)
The billing incident scenario that originally triggered the migration was eliminated — billing service issues no longer affected translation processing
Infrastructure costs decreased by approximately 20% due to right-sizing: notification and analytics services run on smaller ECS tasks than translation processing
Developer velocity improved noticeably — teams could work on their services independently without cross-domain merge conflicts
The full extraction took about 10 months from the first service (billing) to the last (translation processing), with continuous feature delivery throughout
Operational complexity did increase as expected — invested in centralized logging (CloudWatch), distributed tracing, and a service mesh to manage the added complexity

Context

Decision

Alternatives Considered

Keep the monolith, improve internal modularity

Full rewrite as microservices

Strangler fig pattern -- incremental extraction

Reasoning

Context and Background

Implementation

Results