Circuit Breaker Pattern at the API Gateway

The Problem: Cascade Failures

In a microservices architecture, services call other services. When one service becomes slow or unavailable, its callers queue up waiting for responses. Those callers become slow too, so their callers queue up — and within seconds, a single slow service can cascade into a system-wide outage.

Gateway  →  Orders  →  Inventory (slow, 30s timeout)
         ↳  holds 100 threads waiting
         ↳  new requests queue up
         ↳  Gateway itself becomes slow
         ↳  All APIs appear unavailable

The circuit breaker pattern, popularized by Michael Nygard's *Release It!* and the Netflix Hystrix library, prevents this cascade by detecting upstream failures and "opening the circuit" — rejecting requests immediately with a fast failure rather than waiting for a slow timeout.

Circuit Breaker States

A circuit breaker operates as a state machine with three states:

Closed (Normal Operation)

The circuit is closed — current flows — and requests pass through to the upstream service. The breaker monitors responses, tracking the error rate and slow call rate. As long as metrics stay below configured thresholds, the breaker stays closed.

State: CLOSED
       Request → Upstream Service → Response
       [monitor: error_rate = 2%, slow_rate = 5%]  → thresholds OK, stay closed

Open (Failing Fast)

When the error rate or slow call rate exceeds the threshold, the breaker trips to open. In the open state, requests are rejected immediately without contacting the upstream service. This prevents resource exhaustion and gives the failing service time to recover:

State: OPEN
       Request → [Circuit Breaker] → 503 Service Unavailable (no upstream call)
       [wait: open_duration = 30 seconds]

The response time of a tripped breaker is microseconds — no threads held, no connections consumed. This fast failure allows healthy parts of the system to continue operating while the failing service recovers.

Half-Open (Testing Recovery)

After the open duration elapses, the breaker transitions to half-open and allows a limited number of probe requests through to test whether the upstream has recovered:

State: HALF-OPEN
       Request (probe) → Upstream → Success?  → close circuit
                                  → Failure?  → reopen circuit (reset timer)

If the probe requests succeed, the breaker closes. If they fail, it reopens and waits again. The half-open state prevents thundering herd — a just-recovered service being immediately hammered by all the queued traffic.

Failure Detection Configuration

Effective circuit breakers require careful threshold tuning:

Error Rate Threshold

The percentage of requests that return 5xx errors before the breaker opens:

# Example: open if more than 50% of recent requests fail
failure_rate_threshold: 50   # percent
minimum_number_of_calls: 20  # ignore rate until this many calls observed
sliding_window_size: 100     # evaluate last 100 calls

The minimum_number_of_calls prevents the breaker from opening after just one or two failures at startup or during low-traffic periods.

Slow Call Threshold

Slow responses are as harmful as errors — they hold threads and connections. Configure a separate threshold to open the breaker based on response latency:

slow_call_rate_threshold: 80     # percent of calls considered "slow"
slow_call_duration_threshold: 2000  # milliseconds — calls over 2s are "slow"

Window Types

Count-based: evaluate the last N calls (memory-efficient)
Time-based: evaluate calls in the last N seconds (more responsive to spikes)

Gateway Implementation

Envoy Outlier Detection

Envoy's built-in outlier detection implements circuit breaking at the cluster level (per upstream service). It ejects unhealthy hosts from the load balancing pool:

clusters:
  - name: inventory_service
    outlier_detection:
      consecutive_5xx: 5          # eject after 5 consecutive 5xx
      interval: 10s               # evaluation interval
      base_ejection_time: 30s     # ejection duration
      max_ejection_percent: 50    # never eject more than 50% of hosts
      consecutive_gateway_failure: 5
      enforcing_consecutive_5xx: 100  # enforcement percentage
    circuit_breakers:
      thresholds:
        - max_connections: 100
          max_pending_requests: 100
          max_requests: 200
          max_retries: 3

Envoy circuit breakers (circuit_breakers) limit connection and request counts — they prevent resource exhaustion even before the service starts returning errors.

Kong Circuit Breaker (passive health checks)

upstreams:
  - name: inventory-upstream
    healthchecks:
      passive:
        healthy:
          successes: 2        # mark healthy after 2 successes
        unhealthy:
          http_failures: 5    # mark unhealthy after 5 http failures
          tcp_failures: 2
          timeouts: 3
          http_statuses: [500, 502, 503, 504]

Fallback Responses

When the circuit is open, the gateway must return *something*. Options include:

Cached Response

Return the last successful response from cache. Ideal for read-heavy, infrequently-changing data (product catalogs, configuration):

-- Kong custom plugin (Lua)
local function handle_upstream_error(conf)
    local cached = kong.cache:get('last_good_response:' .. route_id)
    if cached then
        kong.response.set_header('X-Served-From', 'cache-fallback')
        return kong.response.exit(200, cached)
    end
end

Default Payload

Return a hardcoded "degraded" response. Useful when fresh data is unavailable but a partial response is better than an error:

{
  "status": "degraded",
  "message": "Inventory data temporarily unavailable. Showing estimated availability.",
  "items": [{"in_stock": true, "quantity": null}]
}

503 with Retry-After

When no fallback is appropriate, return a 503 with a Retry-After header indicating when the client should try again. This prevents retry storms:

HTTP/1.1 503 Service Unavailable
Retry-After: 30
X-Circuit-Breaker: open

{
  "error": "service_unavailable",
  "retry_after": 30
}

Monitoring Circuit State

Circuit state transitions are operationally significant events. Emit metrics and alerts when circuits open or close:

# Prometheus metrics for circuit breaker state
circuit_breaker_state = Gauge(
    'gateway_circuit_breaker_state',
    'Circuit breaker state (0=closed, 1=open, 2=half-open)',
    ['service', 'route']
)

circuit_breaker_transitions_total = Counter(
    'gateway_circuit_breaker_transitions_total',
    'Number of circuit breaker state transitions',
    ['service', 'from_state', 'to_state']
)

Set alerts on gateway_circuit_breaker_state == 1 (open) with a 2-minute delay to avoid alert fatigue from transient flaps. Page on-call when a circuit stays open for more than 5 minutes.

Summary

Circuit breakers at the gateway prevent cascade failures by detecting upstream degradation and failing fast instead of holding connections open. Configure failure thresholds carefully — too sensitive causes flapping, too lenient allows cascades. Always implement fallback responses so clients receive meaningful errors rather than timeouts, and emit circuit state changes as observable metrics.