The Problem: Cascade Failures
In a microservices architecture, services call other services. When one service becomes slow or unavailable, its callers queue up waiting for responses. Those callers become slow too, so their callers queue up — and within seconds, a single slow service can cascade into a system-wide outage.
Gateway → Orders → Inventory (slow, 30s timeout)
↳ holds 100 threads waiting
↳ new requests queue up
↳ Gateway itself becomes slow
↳ All APIs appear unavailable
The circuit breaker pattern, popularized by Michael Nygard's *Release It!* and the Netflix Hystrix library, prevents this cascade by detecting upstream failures and "opening the circuit" — rejecting requests immediately with a fast failure rather than waiting for a slow timeout.
Circuit Breaker States
A circuit breaker operates as a state machine with three states:
Closed (Normal Operation)
The circuit is closed — current flows — and requests pass through to the upstream service. The breaker monitors responses, tracking the error rate and slow call rate. As long as metrics stay below configured thresholds, the breaker stays closed.
State: CLOSED
Request → Upstream Service → Response
[monitor: error_rate = 2%, slow_rate = 5%] → thresholds OK, stay closed
Open (Failing Fast)
When the error rate or slow call rate exceeds the threshold, the breaker trips to open. In the open state, requests are rejected immediately without contacting the upstream service. This prevents resource exhaustion and gives the failing service time to recover:
State: OPEN
Request → [Circuit Breaker] → 503 Service Unavailable (no upstream call)
[wait: open_duration = 30 seconds]
The response time of a tripped breaker is microseconds — no threads held, no connections consumed. This fast failure allows healthy parts of the system to continue operating while the failing service recovers.
Half-Open (Testing Recovery)
After the open duration elapses, the breaker transitions to half-open and allows a limited number of probe requests through to test whether the upstream has recovered:
State: HALF-OPEN
Request (probe) → Upstream → Success? → close circuit
→ Failure? → reopen circuit (reset timer)
If the probe requests succeed, the breaker closes. If they fail, it reopens and waits again. The half-open state prevents thundering herd — a just-recovered service being immediately hammered by all the queued traffic.
Failure Detection Configuration
Effective circuit breakers require careful threshold tuning:
Error Rate Threshold
The percentage of requests that return 5xx errors before the breaker opens:
# Example: open if more than 50% of recent requests fail
failure_rate_threshold: 50 # percent
minimum_number_of_calls: 20 # ignore rate until this many calls observed
sliding_window_size: 100 # evaluate last 100 calls
The minimum_number_of_calls prevents the breaker from opening after just one or two failures at startup or during low-traffic periods.
Slow Call Threshold
Slow responses are as harmful as errors — they hold threads and connections. Configure a separate threshold to open the breaker based on response latency:
slow_call_rate_threshold: 80 # percent of calls considered "slow"
slow_call_duration_threshold: 2000 # milliseconds — calls over 2s are "slow"
Window Types
- Count-based: evaluate the last N calls (memory-efficient)
- Time-based: evaluate calls in the last N seconds (more responsive to spikes)
Gateway Implementation
Envoy Outlier Detection
Envoy's built-in outlier detection implements circuit breaking at the cluster level (per upstream service). It ejects unhealthy hosts from the load balancing pool:
clusters:
- name: inventory_service
outlier_detection:
consecutive_5xx: 5 # eject after 5 consecutive 5xx
interval: 10s # evaluation interval
base_ejection_time: 30s # ejection duration
max_ejection_percent: 50 # never eject more than 50% of hosts
consecutive_gateway_failure: 5
enforcing_consecutive_5xx: 100 # enforcement percentage
circuit_breakers:
thresholds:
- max_connections: 100
max_pending_requests: 100
max_requests: 200
max_retries: 3
Envoy circuit breakers (circuit_breakers) limit connection and request counts — they prevent resource exhaustion even before the service starts returning errors.
Kong Circuit Breaker (passive health checks)
upstreams:
- name: inventory-upstream
healthchecks:
passive:
healthy:
successes: 2 # mark healthy after 2 successes
unhealthy:
http_failures: 5 # mark unhealthy after 5 http failures
tcp_failures: 2
timeouts: 3
http_statuses: [500, 502, 503, 504]
Fallback Responses
When the circuit is open, the gateway must return *something*. Options include:
Cached Response
Return the last successful response from cache. Ideal for read-heavy, infrequently-changing data (product catalogs, configuration):
-- Kong custom plugin (Lua)
local function handle_upstream_error(conf)
local cached = kong.cache:get('last_good_response:' .. route_id)
if cached then
kong.response.set_header('X-Served-From', 'cache-fallback')
return kong.response.exit(200, cached)
end
end
Default Payload
Return a hardcoded "degraded" response. Useful when fresh data is unavailable but a partial response is better than an error:
{
"status": "degraded",
"message": "Inventory data temporarily unavailable. Showing estimated availability.",
"items": [{"in_stock": true, "quantity": null}]
}
503 with Retry-After
When no fallback is appropriate, return a 503 with a Retry-After header indicating when the client should try again. This prevents retry storms:
HTTP/1.1 503 Service Unavailable
Retry-After: 30
X-Circuit-Breaker: open
{
"error": "service_unavailable",
"retry_after": 30
}
Monitoring Circuit State
Circuit state transitions are operationally significant events. Emit metrics and alerts when circuits open or close:
# Prometheus metrics for circuit breaker state
circuit_breaker_state = Gauge(
'gateway_circuit_breaker_state',
'Circuit breaker state (0=closed, 1=open, 2=half-open)',
['service', 'route']
)
circuit_breaker_transitions_total = Counter(
'gateway_circuit_breaker_transitions_total',
'Number of circuit breaker state transitions',
['service', 'from_state', 'to_state']
)
Set alerts on gateway_circuit_breaker_state == 1 (open) with a 2-minute delay to avoid alert fatigue from transient flaps. Page on-call when a circuit stays open for more than 5 minutes.
Summary
Circuit breakers at the gateway prevent cascade failures by detecting upstream degradation and failing fast instead of holding connections open. Configure failure thresholds carefully — too sensitive causes flapping, too lenient allows cascades. Always implement fallback responses so clients receive meaningful errors rather than timeouts, and emit circuit state changes as observable metrics.