Why Rate Limit at the Gateway?
Rate limiting protects backend services from being overwhelmed by too many requests — whether from a runaway client, a misconfigured script, or a deliberate denial-of-service attack. Enforcing limits at the gateway means no abusive traffic reaches your services at all; it is rejected at the front door with minimal compute cost.
Rate limiting also enables tiered API monetization: free-tier clients get 100 requests per minute, paid clients get 10,000. The gateway enforces these quotas centrally without any changes to the upstream services.
Rate Limiting Algorithms
Fixed Window Counter
The simplest algorithm: count requests within a fixed time window (e.g., per minute) and reject once the count exceeds the threshold. A window boundary reset at :00 and :60 means a client can legitimately send 2× the limit in a short burst — 100 requests from :55 to :60 plus 100 more from :00 to :05:
Window 00:00-00:59 → 100 requests allowed
Window 01:00-01:59 → 100 requests allowed
Problem: burst of 200 at window boundary (00:55 to 01:05)
Fixed window is easy to implement and has O(1) memory per client, but the boundary burst problem makes it unsuitable for strict rate control.
Sliding Window Log
Store a timestamp log of every request and count how many fall within the last N seconds. Accurate but memory-intensive: each request entry must be stored and expired, which is expensive for high-volume clients:
# Redis ZSET-based sliding window log
def is_allowed(client_id: str, limit: int, window_seconds: int) -> bool:
now = time.time()
pipe = redis.pipeline()
pipe.zremrangebyscore(client_id, 0, now - window_seconds) # evict old
pipe.zadd(client_id, {str(now): now}) # record request
pipe.zcard(client_id) # count in window
pipe.expire(client_id, window_seconds)
_, _, count, _ = pipe.execute()
return count <= limit
Sliding Window Counter
A practical compromise: maintain counters for the current window and the previous window, then interpolate based on how far into the current window you are:
rate = prev_window_count × (1 - elapsed/window) + current_window_count
This approximates a true sliding window with only two counters. Cloudflare uses this algorithm in its global rate limiting product.
Token Bucket
The token bucket algorithm maintains a "bucket" of tokens that refills at a constant rate. Each request consumes one token; if the bucket is empty, the request is rejected. The bucket size controls the maximum burst; the refill rate controls the sustained throughput:
Bucket capacity: 50 tokens (max burst)
Refill rate: 10 tokens/second (sustained rate)
Scenario: idle for 5 seconds → bucket fills to 50 tokens
Client sends 50 requests in 1 second → all allowed (burst)
Client sends 51st request → rejected (bucket empty)
After 1 second → bucket has 10 tokens again
Token bucket is well-suited for APIs that allow bursty traffic. AWS API Gateway and Kong's rate-limiting plugin both use token bucket semantics.
Leaky Bucket
Leaky bucket is the inverse of token bucket: requests enter a queue and are processed at a fixed output rate. Unlike token bucket, leaky bucket smooths traffic — no bursts allowed, even if the client has been idle. This is useful for protecting downstream services that cannot handle spikes, but it adds queuing latency.
Distributed Rate Limiting with Redis
A single gateway instance can track counters in memory, but a cluster of gateway instances needs a shared store. Redis is the standard solution:
# Atomic token bucket in Redis using Lua script
RATE_LIMIT_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3]) -- current timestamp (ms)
local cost = tonumber(ARGV[4]) -- tokens this request costs
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens based on elapsed time
local elapsed = math.max(0, now - last_refill) / 1000
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= cost then
tokens = tokens - cost
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 1 -- allowed
else
return 0 -- rejected
end
"""
Lua scripts execute atomically in Redis, avoiding race conditions between the read of the current count and the write of the new count. This is critical — a non-atomic read-then-write allows two concurrent requests to both read a "1 token left" state and both succeed, effectively defeating the limit.
Redis Cluster Considerations
With Redis Cluster, keys hash to specific shards. Rate limit keys for a given client always land on the same shard (the key includes the client ID), so atomicity is preserved. Shard failure means that client's rate limiting temporarily fails open (allow) or closed (reject), depending on your policy.
Per-Client Quotas and Usage Plans
Quotas operate on longer time windows (daily, monthly) and represent a hard cap on total consumption rather than a throughput rate:
# AWS API Gateway usage plan
UsagePlan:
Throttle:
BurstLimit: 200 # token bucket capacity
RateLimit: 50 # sustained requests/second
Quota:
Limit: 10000 # total requests per period
Period: MONTH
Tier Design
Define distinct tiers with clear upgrade incentives:
| Tier | Rate | Daily Quota | Monthly Quota |
|---|---|---|---|
| Free | 10 req/min | 1,000 | 10,000 |
| Starter | 100 req/min | 50,000 | 500,000 |
| Pro | 1,000 req/min | 500,000 | 5,000,000 |
| Enterprise | Custom | Custom | Custom |
Overage Handling
When a client hits their quota:
- Hard stop: return 429 until the quota resets — simple, predictable
- Soft cap with notification: allow overage, send warning email, bill for excess
- Graceful degradation: allow overage but throttle to a lower rate
429 Response Design
RFC 6585 defines 429 Too Many Requests. A well-designed 429 response tells the client exactly when to retry and how many requests they have remaining:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1709035260
{
"error": "rate_limit_exceeded",
"message": "You have exceeded your rate limit of 100 requests per minute.",
"retry_after": 30,
"limit": 100,
"remaining": 0,
"reset_at": "2024-02-27T12:01:00Z"
}
Standard Rate Limit Headers
The IETF draft draft-ietf-httpapi-ratelimit-headers standardizes response headers:
| Header | Meaning |
|---|---|
| `RateLimit-Limit` | Maximum requests in the current window |
| `RateLimit-Remaining` | Requests remaining in the current window |
| `RateLimit-Reset` | Unix timestamp when the window resets |
| `Retry-After` | Seconds until the client can retry (also on 503) |
Include these headers on *every* response, not just 429s. This lets well-behaved clients implement adaptive throttling — they slow down before hitting the limit.
Gateway Configuration Examples
Kong Rate Limiting Plugin
plugins:
- name: rate-limiting
config:
minute: 100 # 100 req/min
hour: 5000 # 5,000 req/hour
policy: redis # use Redis for cluster-wide limits
redis_host: redis.internal
redis_port: 6379
limit_by: consumer # per authenticated consumer (or: ip, service)
hide_client_headers: false
Envoy Global Rate Limit
# Envoy filter chain with global rate limit
http_filters:
- name: envoy.filters.http.ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: api_gateway
request_type: external
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: rate_limit_service
Envoy delegates rate limit decisions to a separate gRPC rate limit service (e.g., Lyft's open-source ratelimit service), which stores counters in Redis.
Summary
Choose your algorithm based on traffic characteristics: fixed window for simplicity, sliding window counter for accuracy without memory overhead, token bucket for burst-friendly APIs. Use Redis Lua scripts for atomic distributed counting. Design 429 responses with Retry-After and RateLimit-* headers so clients can recover gracefully rather than blindly retrying.