API Gateway Rate Limiting Patterns: Token Bucket, Sliding Window, and Quotas

Why Rate Limit at the Gateway?

Rate limiting protects backend services from being overwhelmed by too many requests — whether from a runaway client, a misconfigured script, or a deliberate denial-of-service attack. Enforcing limits at the gateway means no abusive traffic reaches your services at all; it is rejected at the front door with minimal compute cost.

Rate limiting also enables tiered API monetization: free-tier clients get 100 requests per minute, paid clients get 10,000. The gateway enforces these quotas centrally without any changes to the upstream services.

Rate Limiting Algorithms

Fixed Window Counter

The simplest algorithm: count requests within a fixed time window (e.g., per minute) and reject once the count exceeds the threshold. A window boundary reset at :00 and :60 means a client can legitimately send 2× the limit in a short burst — 100 requests from :55 to :60 plus 100 more from :00 to :05:

Window 00:00-00:59  → 100 requests allowed
Window 01:00-01:59  → 100 requests allowed
Problem: burst of 200 at window boundary (00:55 to 01:05)

Fixed window is easy to implement and has O(1) memory per client, but the boundary burst problem makes it unsuitable for strict rate control.

Sliding Window Log

Store a timestamp log of every request and count how many fall within the last N seconds. Accurate but memory-intensive: each request entry must be stored and expired, which is expensive for high-volume clients:

# Redis ZSET-based sliding window log
def is_allowed(client_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    pipe = redis.pipeline()
    pipe.zremrangebyscore(client_id, 0, now - window_seconds)  # evict old
    pipe.zadd(client_id, {str(now): now})                      # record request
    pipe.zcard(client_id)                                       # count in window
    pipe.expire(client_id, window_seconds)
    _, _, count, _ = pipe.execute()
    return count <= limit

Sliding Window Counter

A practical compromise: maintain counters for the current window and the previous window, then interpolate based on how far into the current window you are:

rate = prev_window_count × (1 - elapsed/window) + current_window_count

This approximates a true sliding window with only two counters. Cloudflare uses this algorithm in its global rate limiting product.

Token Bucket

The token bucket algorithm maintains a "bucket" of tokens that refills at a constant rate. Each request consumes one token; if the bucket is empty, the request is rejected. The bucket size controls the maximum burst; the refill rate controls the sustained throughput:

Bucket capacity: 50 tokens (max burst)
Refill rate:     10 tokens/second (sustained rate)

Scenario: idle for 5 seconds → bucket fills to 50 tokens
         Client sends 50 requests in 1 second → all allowed (burst)
         Client sends 51st request → rejected (bucket empty)
         After 1 second → bucket has 10 tokens again

Token bucket is well-suited for APIs that allow bursty traffic. AWS API Gateway and Kong's rate-limiting plugin both use token bucket semantics.

Leaky Bucket

Leaky bucket is the inverse of token bucket: requests enter a queue and are processed at a fixed output rate. Unlike token bucket, leaky bucket smooths traffic — no bursts allowed, even if the client has been idle. This is useful for protecting downstream services that cannot handle spikes, but it adds queuing latency.

Distributed Rate Limiting with Redis

A single gateway instance can track counters in memory, but a cluster of gateway instances needs a shared store. Redis is the standard solution:

# Atomic token bucket in Redis using Lua script
RATE_LIMIT_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])           -- current timestamp (ms)
local cost = tonumber(ARGV[4])          -- tokens this request costs

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens based on elapsed time
local elapsed = math.max(0, now - last_refill) / 1000
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens >= cost then
    tokens = tokens - cost
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 3600)
    return 1  -- allowed
else
    return 0  -- rejected
end
"""

Lua scripts execute atomically in Redis, avoiding race conditions between the read of the current count and the write of the new count. This is critical — a non-atomic read-then-write allows two concurrent requests to both read a "1 token left" state and both succeed, effectively defeating the limit.

Redis Cluster Considerations

With Redis Cluster, keys hash to specific shards. Rate limit keys for a given client always land on the same shard (the key includes the client ID), so atomicity is preserved. Shard failure means that client's rate limiting temporarily fails open (allow) or closed (reject), depending on your policy.

Per-Client Quotas and Usage Plans

Quotas operate on longer time windows (daily, monthly) and represent a hard cap on total consumption rather than a throughput rate:

# AWS API Gateway usage plan
UsagePlan:
  Throttle:
    BurstLimit: 200       # token bucket capacity
    RateLimit: 50         # sustained requests/second
  Quota:
    Limit: 10000          # total requests per period
    Period: MONTH

Tier Design

Define distinct tiers with clear upgrade incentives:

Tier	Rate	Daily Quota	Monthly Quota
Free	10 req/min	1,000	10,000
Starter	100 req/min	50,000	500,000
Pro	1,000 req/min	500,000	5,000,000
Enterprise	Custom	Custom	Custom

Overage Handling

When a client hits their quota:

Hard stop: return 429 until the quota resets — simple, predictable
Soft cap with notification: allow overage, send warning email, bill for excess
Graceful degradation: allow overage but throttle to a lower rate

429 Response Design

RFC 6585 defines 429 Too Many Requests. A well-designed 429 response tells the client exactly when to retry and how many requests they have remaining:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1709035260

{
  "error": "rate_limit_exceeded",
  "message": "You have exceeded your rate limit of 100 requests per minute.",
  "retry_after": 30,
  "limit": 100,
  "remaining": 0,
  "reset_at": "2024-02-27T12:01:00Z"
}

Standard Rate Limit Headers

The IETF draft draft-ietf-httpapi-ratelimit-headers standardizes response headers:

Header	Meaning
`RateLimit-Limit`	Maximum requests in the current window
`RateLimit-Remaining`	Requests remaining in the current window
`RateLimit-Reset`	Unix timestamp when the window resets
`Retry-After`	Seconds until the client can retry (also on 503)

Include these headers on *every* response, not just 429s. This lets well-behaved clients implement adaptive throttling — they slow down before hitting the limit.

Gateway Configuration Examples

Kong Rate Limiting Plugin

plugins:
  - name: rate-limiting
    config:
      minute: 100          # 100 req/min
      hour: 5000           # 5,000 req/hour
      policy: redis        # use Redis for cluster-wide limits
      redis_host: redis.internal
      redis_port: 6379
      limit_by: consumer   # per authenticated consumer (or: ip, service)
      hide_client_headers: false

Envoy Global Rate Limit

# Envoy filter chain with global rate limit
http_filters:
  - name: envoy.filters.http.ratelimit
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
      domain: api_gateway
      request_type: external
      rate_limit_service:
        grpc_service:
          envoy_grpc:
            cluster_name: rate_limit_service

Envoy delegates rate limit decisions to a separate gRPC rate limit service (e.g., Lyft's open-source ratelimit service), which stores counters in Redis.

Summary

Choose your algorithm based on traffic characteristics: fixed window for simplicity, sliding window counter for accuracy without memory overhead, token bucket for burst-friendly APIs. Use Redis Lua scripts for atomic distributed counting. Design 429 responses with Retry-After and RateLimit-* headers so clients can recover gracefully rather than blindly retrying.