Optimizing Distributed Systems: From Heap Leaks to Multi-Level Caching

Introduction

Performance optimization in distributed systems is fundamentally different from optimizing a single application. When your request path crosses network boundaries, hits multiple caches, queries several databases, and aggregates results from downstream services, identifying the actual bottleneck requires a systematic approach. Intuition is unreliable because distributed systems are full of surprising interactions.

This article covers two categories of optimization I have worked on extensively: diagnosing memory issues (heap leaks, GC pressure, excessive object allocation) and implementing multi-level caching strategies. Both are common in Java microservices architectures, and both have a disproportionate impact on system health.

Diagnosing Heap Issues in Production

The first sign of a memory problem is usually not an OutOfMemoryError. It is increasing p99 latency caused by longer and more frequent garbage collection pauses. By the time you see OOM errors, the problem has been building for days or weeks.

Continuous profiling is the single most valuable tool for catching memory issues before they become incidents. I use async-profiler in production with low overhead to capture allocation profiles:

# Attach to running JVM and sample allocations for 60 seconds
./asprof -d 60 -e alloc --alloc 512k -f /tmp/alloc-profile.jfr <PID>

The --alloc 512k flag samples every 512KB of allocation, which keeps overhead under 2% while still capturing meaningful data. The output shows which code paths are allocating the most memory, which is often not where you expect.

In one case, our translation memory service had a slow memory leak that manifested as full GC pauses increasing from 50ms to 800ms over a week. The allocation profile revealed the culprit: a caching layer that stored entire TranslationMemory objects including their full revision history, when only the current translation was needed.

// Before: Caching entire entity with lazy-loaded collections
@Cacheable(value = "translationMemory", key = "#sourceHash")
public TranslationMemory findBySourceHash(String sourceHash) {
    TranslationMemory tm = repository.findBySourceHash(sourceHash);
    // This triggers lazy loading of the entire revision history
    tm.getRevisions().size(); // force initialization for cache serialization
    return tm;
}

// After: Cache only the projection needed
@Cacheable(value = "translationMemory", key = "#sourceHash")
public TranslationMemoryEntry findBySourceHash(String sourceHash) {
    return repository.findProjectedBySourceHash(sourceHash);
}

// Lean projection record
public record TranslationMemoryEntry(
    String sourceHash,
    String sourceText,
    String translatedText,
    String targetLanguage,
    float confidenceScore,
    Instant lastUsed
) {}

This change reduced the average cached object size from 12KB to 280 bytes, which cut memory usage by 97% for that cache and eliminated the GC pressure entirely.

Heap dump analysis is necessary when continuous profiling does not reveal the issue. I take heap dumps with:

# Trigger heap dump without stopping the JVM
jcmd <PID> GC.heap_dump /tmp/heapdump.hprof

Then analyze with Eclipse MAT (Memory Analyzer Tool). The most useful report is the Leak Suspects report, which identifies the largest retained object trees. In distributed systems, the most common leaks I encounter are:

  1. Unbounded caches - Maps used as caches without eviction policies that grow indefinitely
  2. Connection pool exhaustion - HTTP clients or database connections not being returned to the pool
  3. Listener/callback accumulation - Event listeners registered but never deregistered
  4. Large byte buffers - Response bodies from HTTP clients held in memory longer than necessary
// Common leak pattern: unbounded concurrent map used as cache
private final Map<String, TranslationResult> recentResults = new ConcurrentHashMap<>();

public TranslationResult translate(String key, String text) {
    return recentResults.computeIfAbsent(key, k -> doTranslate(text));
    // Never evicts! Grows forever.
}

// Fix: Use Caffeine with bounded size and TTL
private final Cache<String, TranslationResult> recentResults = Caffeine.newBuilder()
    .maximumSize(50_000)
    .expireAfterWrite(Duration.ofHours(1))
    .recordStats()
    .build();

public TranslationResult translate(String key, String text) {
    return recentResults.get(key, k -> doTranslate(text));
}

Multi-Level Caching with Caffeine and Redis

The most impactful performance optimization in our system was implementing multi-level caching: a local Caffeine cache (L1) backed by a distributed Redis cache (L2), with the database as the final source of truth. This pattern reduced our database query load by 85% and cut p50 latency from 45ms to 3ms for cached entries.

The principle is simple: local cache is fast (sub-millisecond) but limited in size and not shared across instances. Redis is slower (1-3ms network round trip) but shared and larger. The database is the slowest but always has the authoritative data.

@Service
@Slf4j
public class MultiLevelCacheService<K, V> {

    private final Cache<K, V> localCache;
    private final RedisTemplate<String, V> redisTemplate;
    private final String cachePrefix;
    private final Duration redisTtl;
    private final MeterRegistry meterRegistry;

    public MultiLevelCacheService(String cacheName, int localMaxSize,
                                   Duration localTtl, Duration redisTtl,
                                   RedisTemplate<String, V> redisTemplate,
                                   MeterRegistry meterRegistry) {
        this.cachePrefix = cacheName + ":";
        this.redisTtl = redisTtl;
        this.redisTemplate = redisTemplate;
        this.meterRegistry = meterRegistry;

        this.localCache = Caffeine.newBuilder()
            .maximumSize(localMaxSize)
            .expireAfterWrite(localTtl)
            .recordStats()
            .build();
    }

    public V get(K key, Function<K, V> loader) {
        // L1: Local cache check
        V value = localCache.getIfPresent(key);
        if (value != null) {
            meterRegistry.counter("cache.hit", "level", "local", "cache", cachePrefix).increment();
            return value;
        }

        // L2: Redis cache check
        String redisKey = cachePrefix + key.toString();
        value = redisTemplate.opsForValue().get(redisKey);
        if (value != null) {
            meterRegistry.counter("cache.hit", "level", "redis", "cache", cachePrefix).increment();
            localCache.put(key, value);
            return value;
        }

        // L3: Load from source
        meterRegistry.counter("cache.miss", "cache", cachePrefix).increment();
        value = loader.apply(key);
        if (value != null) {
            localCache.put(key, value);
            redisTemplate.opsForValue().set(redisKey, value, redisTtl);
        }
        return value;
    }

    public void evict(K key) {
        localCache.invalidate(key);
        redisTemplate.delete(cachePrefix + key.toString());
    }

    public void evictLocal(K key) {
        localCache.invalidate(key);
    }
}

Cache invalidation is the hard part. In a multi-instance deployment, when one instance updates data, all instances need to invalidate their local caches. I use Kafka for cache invalidation events:

@Component
@Slf4j
public class CacheInvalidationListener {

    private final Map<String, MultiLevelCacheService<?, ?>> caches;

    @KafkaListener(topics = "cache.invalidation", groupId = "${spring.application.name}-${random.uuid}")
    public void onInvalidation(CacheInvalidationEvent event) {
        MultiLevelCacheService<?, ?> cache = caches.get(event.cacheName());
        if (cache != null) {
            log.debug("Invalidating local cache {} key {}", event.cacheName(), event.key());
            cache.evictLocal(event.key());
        }
    }
}

@Component
public class CacheInvalidationPublisher {

    private final KafkaTemplate<String, CacheInvalidationEvent> kafkaTemplate;

    public void publishInvalidation(String cacheName, String key) {
        kafkaTemplate.send("cache.invalidation",
            new CacheInvalidationEvent(cacheName, key, Instant.now()));
    }
}

The Kafka consumer group ID includes a random UUID so that every instance receives every invalidation event (broadcast behavior). The Redis cache is the authoritative distributed cache, so even if a local invalidation message is delayed, the worst case is serving slightly stale data from the local cache until its TTL expires.

Profiling Request Paths Across Services

For cross-service performance analysis, distributed tracing is essential but not sufficient. Traces show you where time is spent, but they do not tell you why. I combine traces with targeted profiling to identify the root cause.

Latency breakdown analysis starts with identifying which service in the request path contributes the most latency:

@Aspect
@Component
@Slf4j
public class LatencyBreakdownAspect {

    private final MeterRegistry meterRegistry;

    @Around("@annotation(Timed)")
    public Object measureLatency(ProceedingJoinPoint joinPoint) throws Throwable {
        String operationName = joinPoint.getSignature().toShortString();
        Timer.Sample sample = Timer.start(meterRegistry);

        try {
            Object result = joinPoint.proceed();
            sample.stop(Timer.builder("operation.latency")
                .tag("operation", operationName)
                .tag("outcome", "success")
                .register(meterRegistry));
            return result;
        } catch (Throwable t) {
            sample.stop(Timer.builder("operation.latency")
                .tag("operation", operationName)
                .tag("outcome", "error")
                .register(meterRegistry));
            throw t;
        }
    }
}

Once you identify the slow operation, method-level profiling reveals the hotspot. I use JFR (Java Flight Recorder) for this because it has negligible overhead and produces detailed execution profiles:

@Component
public class ProfilingService {

    public void startProfiling(Duration duration, Path outputPath) throws Exception {
        Configuration config = Configuration.getConfiguration("profile");

        try (Recording recording = new Recording(config)) {
            recording.setMaxAge(duration);
            recording.setDestination(outputPath);
            recording.start();

            Thread.sleep(duration.toMillis());

            recording.stop();
        }
    }
}
# Start a 5-minute JFR recording in production
jcmd <PID> JFR.start duration=300s filename=/tmp/recording.jfr settings=profile

Connection Pool and Thread Pool Sizing

The most common performance anti-pattern I see is oversized thread pools and connection pools. More threads and connections do not mean better performance. Past a certain point, they make things worse by increasing context switching, lock contention, and database load.

For database connection pools, I follow this formula: optimal pool size = (core_count * 2) + disk_spindles. For modern SSDs with no spindle, this simplifies to roughly core_count * 2. For a 4-core ECS task, that means 8-10 connections:

spring:
  datasource:
    hikari:
      maximum-pool-size: 10
      minimum-idle: 5
      connection-timeout: 5000
      validation-timeout: 3000
      leak-detection-threshold: 30000

For HTTP client connection pools, the sizing depends on the downstream service’s capacity:

@Bean
public RestClient translationEngineClient() {
    var connectionManager = PoolingHttpClientConnectionManagerBuilder.create()
        .setMaxConnTotal(100)
        .setMaxConnPerRoute(50)
        .setDefaultConnectionConfig(ConnectionConfig.custom()
            .setConnectTimeout(Timeout.ofSeconds(3))
            .setSocketTimeout(Timeout.ofSeconds(10))
            .build())
        .build();

    var httpClient = HttpClients.custom()
        .setConnectionManager(connectionManager)
        .setKeepAliveStrategy((response, context) -> TimeValue.ofSeconds(30))
        .build();

    return RestClient.builder()
        .requestFactory(new HttpComponentsClientHttpRequestFactory(httpClient))
        .baseUrl(translationEngineBaseUrl)
        .build();
}

Key Takeaways

Performance optimization in distributed systems requires systematic measurement, not intuition. Start with distributed tracing to identify which service or operation dominates your latency. Use continuous allocation profiling to catch memory issues early. Implement multi-level caching to eliminate redundant work across the request path.

The highest-impact optimizations are often surprisingly simple: caching the right data at the right layer, sizing connection pools correctly, and fixing memory leaks caused by unbounded collections. Exotic optimizations like custom serialization or lock-free data structures are rarely the answer. The basics, done well, get you 90% of the way there.