Implementing Multi-Level Caching with Redis and Local Cache
Context
The translation platform was experiencing high latency on frequently accessed endpoints -- translation memory lookups, glossary fetches, and user preference loading. Database queries were the bottleneck, with the same data being fetched repeatedly across requests. A single caching layer was insufficient due to the varied access patterns and consistency requirements across different data types.
Decision
Implement a two-level caching strategy using Caffeine as the L1 local cache and Redis as the L2 distributed cache, with cache-aside pattern and configurable TTLs per data type
Alternatives Considered
Redis-only caching
- Single cache layer simplifies architecture
- Consistent view across all service instances
- Well-understood invalidation model
- Every cache hit still requires a network round-trip to Redis (~1-2ms)
- Redis becomes a single point of failure for all cached data
- Network latency adds up for high-frequency lookups like translation memory segments
Local cache only (Caffeine/Guava)
- Sub-microsecond access times, no network overhead
- No additional infrastructure to manage
- Simple implementation with Spring Cache abstraction
- Each service instance maintains its own cache, leading to inconsistency
- Cache is lost on deployment or restart
- Memory pressure on application heap
- No sharing between service instances
Two-level cache: Caffeine L1 + Redis L2
- Hot data served from local memory with sub-microsecond latency
- Redis provides shared, durable cache across instances
- Reduces Redis load by 60-70% since most hits resolve at L1
- Graceful degradation -- if Redis is unavailable, L1 still serves
- More complex invalidation logic across two layers
- Potential for short-lived inconsistency between L1 caches on different instances
- Requires careful TTL tuning per data type
Reasoning
The translation memory lookup endpoint was called approximately 200 times per translation job, and with hundreds of concurrent jobs, the total QPS on this data was enormous. Redis-only caching reduced database load but the 1-2ms network round-trip per lookup was still adding 200-400ms per job. By placing a Caffeine L1 cache in front of Redis, we eliminated the network hop for the hottest data. The short-lived inconsistency between instances was acceptable for our use case -- translation memory and glossary data changes infrequently, and a 30-second staleness window had no practical impact on translation quality.
Context and Background
The translation platform’s core workflow involves looking up previously translated segments (translation memory) to suggest matches for new content. Each document being translated generates hundreds of segment lookups against the translation memory database. At peak load with enterprise batch jobs, the PostgreSQL database was processing upwards of 15,000 translation memory queries per minute, and response times were degrading.
We had already optimized the database layer with proper indexing and query tuning, but the fundamental problem was that the same translation memory segments were being fetched repeatedly. A customer translating a 200-page technical manual would hit the same glossary terms and recurring phrases thousands of times. The data was highly cacheable — translation memories are append-mostly, and glossary entries change perhaps once a week.
Our initial approach with Redis-only caching improved things significantly, cutting database load by about 70%. But profiling revealed that the Redis network round-trips were now the bottleneck for the translation memory lookup endpoint. Each lookup took 1-2ms to Redis, and with 200+ lookups per job, this added 200-400ms of pure network overhead per translation job. We needed a cache that could serve the hottest data without any network hop.
Implementation
-
Cache abstraction layer: Built a custom
MultiLevelCacheManagerimplementing Spring’sCacheManagerinterface. This allowed existing@Cacheableannotations to work transparently with the two-level cache. The manager delegates to aMultiLevelCachethat checks Caffeine first, then Redis, then the database. -
L1 configuration (Caffeine): Configured Caffeine with per-cache-name settings. Translation memory cache: max 10,000 entries, 30-second TTL. Glossary cache: max 5,000 entries, 5-minute TTL. User preferences: max 2,000 entries, 60-second TTL. Used
recordStats()for hit rate monitoring via Micrometer. -
L2 configuration (Redis): Used Spring Data Redis with Lettuce client in cluster mode. Translation memory entries cached with 10-minute TTL. Glossary entries with 30-minute TTL. Serialization via Kryo for compact binary representation, reducing Redis memory usage by roughly 40% compared to JSON serialization.
-
Write-through with L1 invalidation: On cache writes, data flows to Redis first (source of truth for cache layer), then populates the local Caffeine cache on the writing instance. Other instances pick up changes on their next L1 miss, which falls through to Redis. For critical invalidations (glossary updates), we publish a Redis Pub/Sub message that triggers L1 eviction across all instances.
-
Cache warming on startup: Implemented a
CacheWarmercomponent that pre-loads the top 1,000 most-accessed translation memory segments and active glossary entries into both L1 and L2 during service startup. This eliminated the cold-start penalty after deployments. -
Monitoring dashboard: Exposed cache metrics via Micrometer to CloudWatch: L1 hit rate, L2 hit rate, overall hit rate, eviction counts, and cache size. Set up alerts for L1 hit rate dropping below 60% or overall hit rate below 85%.
Results
- Overall cache hit rate stabilized at approximately 92%, with L1 (Caffeine) resolving about 75% of all lookups and L2 (Redis) handling another 17%
- Translation memory lookup endpoint p95 latency dropped from ~45ms to ~3ms for cached segments
- PostgreSQL query volume for translation memory decreased by roughly 90%, freeing database capacity for write-heavy operations
- Redis connection count dropped by approximately 65% since most reads are now served from local cache
- Translation job end-to-end processing time improved by about 15% on average, with larger improvements on jobs with repetitive content
- Cache warming eliminated the post-deployment latency spike that previously lasted 2-3 minutes while caches repopulated organically
- Memory overhead of the L1 cache was modest at roughly 150MB per instance, well within the headroom of our ECS task definitions