Introduction
Every non-trivial application depends on third-party APIs. Payment processors, translation engines, notification services, identity providers, cloud storage — the list grows with every feature. Each integration is a liability: a third-party outage becomes your outage unless you design for it. A rate limit change breaks your workflows. An API version deprecation becomes an emergency.
After integrating with dozens of third-party APIs across multiple projects, I have developed a consistent set of patterns for making these integrations safe, observable, and maintainable. This article walks through each pattern with Spring Boot implementation details.
The SDK Wrapper Pattern
Never let third-party API details leak into your business logic. Every external integration gets a wrapper service that translates between the external API’s model and your internal domain model. This is not just clean architecture; it is a survival strategy.
When a vendor changes their API, deprecates endpoints, or you need to swap vendors entirely, the blast radius is contained to a single wrapper class. Your business logic, tests, and other services are unaffected.
// Internal domain model - your model, your rules
public record TranslationResult(
String translatedText,
String sourceLanguage,
String targetLanguage,
float confidenceScore,
Duration processingTime
) {}
// Wrapper interface - stable contract for your application
public interface TranslationEngine {
TranslationResult translate(String text, String sourceLang, String targetLang);
List<TranslationResult> translateBatch(List<String> texts, String sourceLang, String targetLang);
boolean supportsLanguagePair(String sourceLang, String targetLang);
}
// Vendor-specific implementation
@Service
@Slf4j
public class VendorATranslationEngine implements TranslationEngine {
private final VendorAClient client;
private final MeterRegistry meterRegistry;
@Override
public TranslationResult translate(String text, String sourceLang, String targetLang) {
Timer.Sample sample = Timer.start(meterRegistry);
try {
// Vendor-specific request format
VendorARequest request = VendorARequest.builder()
.content(text)
.from(mapToVendorLanguageCode(sourceLang))
.to(mapToVendorLanguageCode(targetLang))
.model("nmt-v3")
.build();
VendorAResponse response = client.translate(request);
sample.stop(Timer.builder("translation.engine.duration")
.tag("vendor", "vendor-a")
.tag("outcome", "success")
.register(meterRegistry));
// Map to internal model
return new TranslationResult(
response.getTranslation().getText(),
sourceLang,
targetLang,
response.getTranslation().getScore(),
Duration.ofMillis(response.getMetadata().getProcessingTimeMs())
);
} catch (VendorAException e) {
sample.stop(Timer.builder("translation.engine.duration")
.tag("vendor", "vendor-a")
.tag("outcome", "error")
.register(meterRegistry));
throw mapToInternalException(e);
}
}
private String mapToVendorLanguageCode(String isoCode) {
// Vendor uses non-standard codes for some languages
return switch (isoCode) {
case "zh-CN" -> "chi_sim";
case "zh-TW" -> "chi_tra";
case "pt-BR" -> "por_bra";
default -> isoCode;
};
}
private TranslationEngineException mapToInternalException(VendorAException e) {
return switch (e.getStatusCode()) {
case 429 -> new RateLimitedException("Translation engine rate limited", e);
case 503 -> new ServiceUnavailableException("Translation engine unavailable", e);
case 400 -> new InvalidRequestException("Bad translation request: " + e.getMessage(), e);
default -> new TranslationEngineException("Translation failed: " + e.getMessage(), e);
};
}
}
When it is time to add a second vendor or swap providers, you implement the same interface:
@Service
@ConditionalOnProperty(name = "translation.engine", havingValue = "vendor-b")
public class VendorBTranslationEngine implements TranslationEngine {
// Completely different vendor API, same internal interface
}
Retry Strategies That Actually Work
Not all failures are retryable, and not all retryable failures should use the same strategy. I categorize failures into three buckets:
- Retryable transient errors (network timeouts, 502/503/504): Retry with exponential backoff and jitter
- Retryable rate limit errors (429): Retry with the
Retry-Afterheader value, or exponential backoff with longer initial delay - Non-retryable errors (400, 401, 403, 404): Fail immediately, do not waste resources retrying
@Configuration
public class RetryConfig {
@Bean
public Retry translationEngineRetry() {
return Retry.of("translationEngine", io.github.resilience4j.retry.RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(500))
.intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
Duration.ofMillis(500), // initial interval
2.0, // multiplier
Duration.ofSeconds(10) // max interval
))
.retryOnException(e -> isRetryable(e))
.retryOnResult(response -> false)
.failAfterMaxAttempts(true)
.build());
}
private boolean isRetryable(Throwable throwable) {
if (throwable instanceof RateLimitedException) return true;
if (throwable instanceof ServiceUnavailableException) return true;
if (throwable instanceof java.net.SocketTimeoutException) return true;
if (throwable instanceof java.net.ConnectException) return true;
// Client errors are never retryable
if (throwable instanceof InvalidRequestException) return false;
if (throwable instanceof AuthenticationException) return false;
return false;
}
}
For rate-limited APIs, I implement a proactive rate limiter on our side to avoid hitting the vendor’s limits in the first place:
@Service
public class RateLimitedTranslationEngine implements TranslationEngine {
private final TranslationEngine delegate;
private final RateLimiter rateLimiter;
public RateLimitedTranslationEngine(
@Qualifier("vendorATranslationEngine") TranslationEngine delegate) {
this.delegate = delegate;
this.rateLimiter = RateLimiter.of("translationEngine",
RateLimiterConfig.custom()
.limitForPeriod(100) // 100 requests
.limitRefreshPeriod(Duration.ofSeconds(1)) // per second
.timeoutDuration(Duration.ofSeconds(5)) // wait up to 5s
.build());
}
@Override
public TranslationResult translate(String text, String sourceLang, String targetLang) {
return RateLimiter.decorateSupplier(rateLimiter,
() -> delegate.translate(text, sourceLang, targetLang)).get();
}
}
Circuit Breakers with Fallback Chains
Circuit breakers prevent cascading failures when a third-party API goes down. The key design decision is what to do when the circuit is open. The answer depends on your business requirements.
I use a fallback chain pattern: a prioritized list of alternatives that the system tries in order when the primary integration fails:
@Service
@Slf4j
public class ResilientTranslationService {
private final TranslationEngine primaryEngine;
private final TranslationEngine fallbackEngine;
private final TranslationCacheService cacheService;
private final CircuitBreaker primaryCircuitBreaker;
private final CircuitBreaker fallbackCircuitBreaker;
public TranslationResult translate(String text, String sourceLang, String targetLang) {
// Strategy 1: Try primary engine with circuit breaker
try {
return CircuitBreaker.decorateSupplier(primaryCircuitBreaker,
() -> primaryEngine.translate(text, sourceLang, targetLang)).get();
} catch (Exception primaryEx) {
log.warn("Primary translation engine failed, trying fallback", primaryEx);
}
// Strategy 2: Try fallback engine with its own circuit breaker
try {
TranslationResult result = CircuitBreaker.decorateSupplier(fallbackCircuitBreaker,
() -> fallbackEngine.translate(text, sourceLang, targetLang)).get();
log.info("Fallback engine succeeded for {} -> {}", sourceLang, targetLang);
return result;
} catch (Exception fallbackEx) {
log.warn("Fallback translation engine also failed, trying cache", fallbackEx);
}
// Strategy 3: Serve from cache if available
return cacheService.getCachedTranslation(text, sourceLang, targetLang)
.orElseThrow(() -> new ServiceDegradedException(
"All translation engines are unavailable and no cached result exists"));
}
}
Each engine has its own circuit breaker because their failure modes are independent. The primary engine might be down while the fallback is healthy, and vice versa.
Timeout Configuration and HTTP Client Setup
Timeouts are your most important resilience mechanism. Every external call needs three timeouts:
- Connection timeout: How long to wait for a TCP connection (short: 3-5 seconds)
- Read/socket timeout: How long to wait for a response after connecting (depends on the API: 5-30 seconds)
- Overall request timeout: Maximum wall-clock time for the entire operation including retries (application-specific)
@Configuration
public class HttpClientConfig {
@Bean("translationEngineHttpClient")
public RestClient translationEngineHttpClient(
@Value("${translation.engine.base-url}") String baseUrl,
@Value("${translation.engine.api-key}") String apiKey) {
var connectionManager = PoolingHttpClientConnectionManagerBuilder.create()
.setMaxConnTotal(50)
.setMaxConnPerRoute(25)
.setDefaultConnectionConfig(ConnectionConfig.custom()
.setConnectTimeout(Timeout.ofSeconds(3))
.setSocketTimeout(Timeout.ofSeconds(15))
.build())
.build();
var httpClient = HttpClients.custom()
.setConnectionManager(connectionManager)
.setDefaultRequestConfig(RequestConfig.custom()
.setResponseTimeout(Timeout.ofSeconds(15))
.build())
.evictIdleConnections(TimeValue.ofSeconds(30))
.build();
return RestClient.builder()
.requestFactory(new HttpComponentsClientHttpRequestFactory(httpClient))
.baseUrl(baseUrl)
.defaultHeader("Authorization", "Bearer " + apiKey)
.defaultHeader("Content-Type", "application/json")
.defaultHeader("Accept", "application/json")
.defaultHeader("User-Agent", "TranslationPlatform/1.0")
.defaultStatusHandler(
status -> status.is4xxClientError() || status.is5xxServerError(),
(request, response) -> {
String body = new String(response.getBody().readAllBytes());
throw new TranslationEngineException(
"API error %d: %s".formatted(response.getStatusCode().value(), body));
})
.build();
}
}
Connection pool management is often overlooked. If you create HTTP connections without a pool, each request opens a new TCP connection (and potentially a new TLS handshake). With a properly configured pool, connections are reused, dramatically reducing latency for repeated calls to the same API.
Observability for Third-Party Integrations
You cannot fix what you cannot see. Every third-party integration needs metrics for:
- Request rate (are we approaching rate limits?)
- Error rate by error type (transient vs. permanent failures)
- Latency at p50, p95, and p99 (is the API degrading?)
- Circuit breaker state (is the circuit open, closed, or half-open?)
@Component
@Slf4j
public class IntegrationHealthDashboard {
private final CircuitBreakerRegistry circuitBreakerRegistry;
private final MeterRegistry meterRegistry;
@Scheduled(fixedRate = 30_000)
public void reportIntegrationHealth() {
circuitBreakerRegistry.getAllCircuitBreakers().forEach(cb -> {
CircuitBreaker.Metrics metrics = cb.getMetrics();
meterRegistry.gauge("integration.circuit_breaker.failure_rate",
Tags.of("integration", cb.getName()),
metrics.getFailureRate());
meterRegistry.gauge("integration.circuit_breaker.slow_call_rate",
Tags.of("integration", cb.getName()),
metrics.getSlowCallRate());
meterRegistry.gauge("integration.circuit_breaker.state",
Tags.of("integration", cb.getName(),
"state", cb.getState().name()),
cb.getState() == CircuitBreaker.State.CLOSED ? 1 : 0);
if (cb.getState() != CircuitBreaker.State.CLOSED) {
log.warn("Circuit breaker {} is in {} state. "
+ "Failure rate: {}%, Slow call rate: {}%",
cb.getName(), cb.getState(),
metrics.getFailureRate(), metrics.getSlowCallRate());
}
});
}
}
I also create alerts for:
- Circuit breaker opening (immediate notification)
- Error rate exceeding 5% over a 5-minute window (early warning)
- p99 latency exceeding 2x the normal baseline (degradation detection)
Key Takeaways
Third-party API integration is a risk management problem disguised as a coding problem. The code itself is usually straightforward. The challenge is building sufficient resilience around it so that vendor issues do not become your incidents.
The patterns that matter most are: wrapping every integration behind a stable interface (SDK wrapper pattern), implementing appropriate retry strategies per failure type, using circuit breakers with meaningful fallbacks, configuring timeouts at every level, and instrumenting everything with metrics.
The investment in these patterns pays for itself the first time a vendor has an outage and your system gracefully degrades instead of going down with them. And that first time always comes sooner than you expect.