Implementing Multi-Level Caching with Redis and Local Cache

performancecachingredis

The translation platform was experiencing high latency on frequently accessed endpoints -- translation memory lookups, glossary fetches, and user preference loading. Database queries were the bottleneck, with the same data being fetched repeatedly across requests. A single caching layer was insufficient due to the varied access patterns and consistency requirements across different data types.

Implement a two-level caching strategy using Caffeine as the L1 local cache and Redis as the L2 distributed cache, with cache-aside pattern and configurable TTLs per data type

Redis-only caching

Pros
  • Single cache layer simplifies architecture
  • Consistent view across all service instances
  • Well-understood invalidation model
Cons
  • Every cache hit still requires a network round-trip to Redis (~1-2ms)
  • Redis becomes a single point of failure for all cached data
  • Network latency adds up for high-frequency lookups like translation memory segments

Local cache only (Caffeine/Guava)

Pros
  • Sub-microsecond access times, no network overhead
  • No additional infrastructure to manage
  • Simple implementation with Spring Cache abstraction
Cons
  • Each service instance maintains its own cache, leading to inconsistency
  • Cache is lost on deployment or restart
  • Memory pressure on application heap
  • No sharing between service instances

Two-level cache: Caffeine L1 + Redis L2

Pros
  • Hot data served from local memory with sub-microsecond latency
  • Redis provides shared, durable cache across instances
  • Reduces Redis load by 60-70% since most hits resolve at L1
  • Graceful degradation -- if Redis is unavailable, L1 still serves
Cons
  • More complex invalidation logic across two layers
  • Potential for short-lived inconsistency between L1 caches on different instances
  • Requires careful TTL tuning per data type

The translation memory lookup endpoint was called approximately 200 times per translation job, and with hundreds of concurrent jobs, the total QPS on this data was enormous. Redis-only caching reduced database load but the 1-2ms network round-trip per lookup was still adding 200-400ms per job. By placing a Caffeine L1 cache in front of Redis, we eliminated the network hop for the hottest data. The short-lived inconsistency between instances was acceptable for our use case -- translation memory and glossary data changes infrequently, and a 30-second staleness window had no practical impact on translation quality.

Context and Background

The translation platform’s core workflow involves looking up previously translated segments (translation memory) to suggest matches for new content. Each document being translated generates hundreds of segment lookups against the translation memory database. At peak load with enterprise batch jobs, the PostgreSQL database was processing upwards of 15,000 translation memory queries per minute, and response times were degrading.

We had already optimized the database layer with proper indexing and query tuning, but the fundamental problem was that the same translation memory segments were being fetched repeatedly. A customer translating a 200-page technical manual would hit the same glossary terms and recurring phrases thousands of times. The data was highly cacheable — translation memories are append-mostly, and glossary entries change perhaps once a week.

Our initial approach with Redis-only caching improved things significantly, cutting database load by about 70%. But profiling revealed that the Redis network round-trips were now the bottleneck for the translation memory lookup endpoint. Each lookup took 1-2ms to Redis, and with 200+ lookups per job, this added 200-400ms of pure network overhead per translation job. We needed a cache that could serve the hottest data without any network hop.

Implementation

  1. Cache abstraction layer: Built a custom MultiLevelCacheManager implementing Spring’s CacheManager interface. This allowed existing @Cacheable annotations to work transparently with the two-level cache. The manager delegates to a MultiLevelCache that checks Caffeine first, then Redis, then the database.

  2. L1 configuration (Caffeine): Configured Caffeine with per-cache-name settings. Translation memory cache: max 10,000 entries, 30-second TTL. Glossary cache: max 5,000 entries, 5-minute TTL. User preferences: max 2,000 entries, 60-second TTL. Used recordStats() for hit rate monitoring via Micrometer.

  3. L2 configuration (Redis): Used Spring Data Redis with Lettuce client in cluster mode. Translation memory entries cached with 10-minute TTL. Glossary entries with 30-minute TTL. Serialization via Kryo for compact binary representation, reducing Redis memory usage by roughly 40% compared to JSON serialization.

  4. Write-through with L1 invalidation: On cache writes, data flows to Redis first (source of truth for cache layer), then populates the local Caffeine cache on the writing instance. Other instances pick up changes on their next L1 miss, which falls through to Redis. For critical invalidations (glossary updates), we publish a Redis Pub/Sub message that triggers L1 eviction across all instances.

  5. Cache warming on startup: Implemented a CacheWarmer component that pre-loads the top 1,000 most-accessed translation memory segments and active glossary entries into both L1 and L2 during service startup. This eliminated the cold-start penalty after deployments.

  6. Monitoring dashboard: Exposed cache metrics via Micrometer to CloudWatch: L1 hit rate, L2 hit rate, overall hit rate, eviction counts, and cache size. Set up alerts for L1 hit rate dropping below 60% or overall hit rate below 85%.

Results

  • Overall cache hit rate stabilized at approximately 92%, with L1 (Caffeine) resolving about 75% of all lookups and L2 (Redis) handling another 17%
  • Translation memory lookup endpoint p95 latency dropped from ~45ms to ~3ms for cached segments
  • PostgreSQL query volume for translation memory decreased by roughly 90%, freeing database capacity for write-heavy operations
  • Redis connection count dropped by approximately 65% since most reads are now served from local cache
  • Translation job end-to-end processing time improved by about 15% on average, with larger improvements on jobs with repetitive content
  • Cache warming eliminated the post-deployment latency spike that previously lasted 2-3 minutes while caches repopulated organically
  • Memory overhead of the L1 cache was modest at roughly 150MB per instance, well within the headroom of our ECS task definitions