Splitting the Monolith into Microservices

architecturemicroservicesscalability

The translation platform had grown into a single monolithic Spring Boot application -- internally referred to as the 'god-service' -- that handled translation processing, user management, billing, notifications, file storage, and analytics. Deployments took over 30 minutes, a bug in billing could take down translation processing, and the team could not work on features independently without merge conflicts.

Decompose the monolith into domain-bounded microservices using the strangler fig pattern, extracting one domain at a time while keeping the monolith operational

Keep the monolith, improve internal modularity

Pros
  • No distributed systems complexity
  • Single deployment unit is simpler to manage
  • No network latency between modules
  • Easier debugging and tracing
Cons
  • Deployment coupling remains -- all-or-nothing releases
  • Scaling requires scaling everything, even modules that do not need it
  • Single failure domain -- one module crash takes everything down
  • Team coupling -- everyone works in the same codebase with constant conflicts

Full rewrite as microservices

Pros
  • Clean slate -- design each service properly from scratch
  • No legacy baggage in new services
  • Can choose optimal technology per service
Cons
  • Extremely high risk -- rewrites frequently fail
  • Long period with no feature delivery while rewriting
  • Business logic may be lost or incorrectly reimplemented
  • Team cannot maintain two full systems simultaneously

Strangler fig pattern -- incremental extraction

Pros
  • Low risk -- extract one bounded context at a time
  • Monolith keeps running, no big-bang cutover
  • Each extraction delivers immediate value
  • Team learns distributed patterns gradually
Cons
  • Longer total migration timeline
  • Temporary complexity of running hybrid architecture
  • Requires careful API boundary design upfront
  • Some shared database coupling during transition

The strangler fig approach was chosen because a full rewrite was too risky for a revenue-generating platform, and staying with the monolith was no longer viable given the deployment and scaling issues. By extracting one bounded context at a time -- starting with the most painful (billing) -- we could deliver value incrementally while learning the operational patterns of microservices before applying them to more critical domains like translation processing.

Context and Background

By mid-2023, the translation platform monolith had grown to over 300,000 lines of Java code across roughly 40 Spring components, all deployed as a single application on AWS ECS. What started as a reasonable monolith had accumulated responsibilities far beyond its original scope: translation job orchestration, real-time file format conversion, billing and invoicing, email and webhook notifications, user and organization management, and analytics data aggregation.

The pain was felt across the entire team. Deployments required a full restart of the application, taking 30+ minutes including health check convergence. A memory leak in the analytics aggregation module caused OOM kills that took down translation processing for paying customers. Two developers working on unrelated features — say, adding a new notification channel and optimizing translation memory matching — frequently had merge conflicts because the shared service layer touched everything. Scaling for peak translation loads meant scaling the entire application, including modules like billing that needed minimal resources.

The final catalyst was a production incident where a billing calculation bug corrupted invoice data, and the emergency hotfix required redeploying the entire platform during business hours, causing a 15-minute translation service interruption. The team agreed: we needed service isolation, and we needed it without a risky rewrite.

Implementation

  1. Domain mapping: Conducted an event storming session to identify bounded contexts within the monolith. Identified 6 core domains: Translation Processing, User Management, Billing, Notifications, File Storage, and Analytics. Mapped the existing code modules to these domains and identified the coupling points (shared database tables, direct method calls, shared DTOs).

  2. API gateway introduction: Deployed an API gateway (Spring Cloud Gateway) in front of the monolith. All external traffic was routed through the gateway, which allowed us to redirect specific routes to new microservices without changing client URLs. This was the foundation of the strangler fig pattern.

  3. Billing service extraction (first): Extracted billing as the first microservice because it was the most painful coupling point and had relatively clean domain boundaries. Created a new Spring Boot service with its own PostgreSQL schema. Defined Kafka events for translation.completed and subscription.changed that the billing service consumed. Ran both old and new billing in parallel for 2 weeks, comparing outputs for reconciliation.

  4. Notification service extraction: The notification module was next — it was a natural consumer of events from other domains with minimal write-back. Extracted email, webhook, and SMS notification logic into a dedicated service consuming Kafka events. This service was the first to run on a smaller ECS task definition, demonstrating the cost benefits of independent scaling.

  5. Shared database decomposition: The hardest part was untangling the shared database. Used the database-per-service pattern with a transition period where services had read-only access to the monolith database via database views, while writes went through APIs. Gradually migrated data ownership until each service controlled its own tables.

  6. Translation processing extraction (last): Saved the core translation processing for last because it was the most complex and highest-risk domain. By this point, the team had 6 months of microservice operational experience. Extracted it with its own optimized PostgreSQL schema, Redis caching layer, and Kafka-based job orchestration.

Results

  • Deployment frequency increased from roughly once per week (due to risk and coordination overhead) to multiple independent deployments per day across services
  • Mean time to recovery improved from approximately 30 minutes (full monolith restart) to under 5 minutes (individual service restart or rollback)
  • The billing incident scenario that originally triggered the migration was eliminated — billing service issues no longer affected translation processing
  • Infrastructure costs decreased by approximately 20% due to right-sizing: notification and analytics services run on smaller ECS tasks than translation processing
  • Developer velocity improved noticeably — teams could work on their services independently without cross-domain merge conflicts
  • The full extraction took about 10 months from the first service (billing) to the last (translation processing), with continuous feature delivery throughout
  • Operational complexity did increase as expected — invested in centralized logging (CloudWatch), distributed tracing, and a service mesh to manage the added complexity