StatusCodeFYI

Symptômes

- Browser or API client receives `504 Gateway Timeout` after a long wait (default 60 s)
- Nginx error log shows: `upstream timed out (110: Connection timed out) while reading response header from upstream`
- AWS ALB access logs record `504` with `--` in the target response time column
- The endpoint works fine for simple requests but times out under load or for complex queries
- Health check pages return 200 while specific API routes return 504 intermittently

Causes profondes

Gunicorn worker processing a slow database query exceeds the proxy_read_timeout window
Database lock contention causing queries to wait indefinitely for row-level locks
External API call inside a Django view has no timeout and hangs the worker
Too few Gunicorn workers to handle concurrent requests, causing queue buildup
Nginx `proxy_read_timeout` set lower than the Gunicorn `--timeout`, creating a mismatch

Diagnostic

1. **Identify which endpoint is timing out** and measure its latency baseline:
```bash
time curl -sI https://example.com/api/slow-endpoint/
# If > 60 s, the Nginx default timeout is the boundary
```

2. **Check Nginx timeout configuration:**
```bash
grep -r 'proxy_read_timeout\|proxy_connect_timeout\|proxy_send_timeout' \
/etc/nginx/sites-available/ /etc/nginx/nginx.conf
```

3. **Monitor active database queries** to find long-running or locked queries:
```sql
-- PostgreSQL: find queries running > 10 s
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '10 seconds'
ORDER BY duration DESC;
```

4. **Check Gunicorn worker saturation:**
```bash
# Count busy vs idle workers
ps aux | grep gunicorn | grep -v grep | wc -l
# If all workers are busy, the queue is backing up
```

5. **Reproduce locally with a slow endpoint simulation** to confirm timeout boundary:
```bash
curl --max-time 70 -v https://example.com/api/slow/ 2>&1 | \
grep -E 'timeout|504|upstream'
```

Résolution

**Fix 1: Align Nginx and Gunicorn timeouts** (Nginx must be higher than Gunicorn):
```nginx
# /etc/nginx/sites-available/myapp
location / {
proxy_pass http://127.0.0.1:8000;
proxy_read_timeout 120s; # > gunicorn --timeout
proxy_connect_timeout 10s;
proxy_send_timeout 120s;
}
```
```bash
sudo nginx -t && sudo systemctl reload nginx
```

**Fix 2: Increase Gunicorn worker count and timeout:**
```ini
# /etc/systemd/system/gunicorn-myapp.service
ExecStart=... gunicorn \
--workers 8 \
--timeout 90 \
--worker-class gthread \
--threads 2 \
config.wsgi:application
```

**Fix 3: Add a statement timeout to PostgreSQL** to surface slow queries early:
```python
# config/settings/base.py
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'OPTIONS': {'options': '-c statement_timeout=30000'}, # 30 s
}
}
```

**Fix 4: Add timeouts to all external HTTP calls** in Django views:
```python
import httpx
# Never call without a timeout — a hung upstream hangs your Gunicorn worker
response = httpx.get('https://external-api.com/data', timeout=10.0)
```

Prévention

- **Set timeouts at every layer** (ALB → Nginx → Gunicorn → DB → external HTTP) and ensure each layer's timeout is larger than the one below it
- **Offload slow work to background tasks** — any view taking > 3 seconds is a 504 risk under load; use django-tasks or Celery
- **Add EXPLAIN ANALYZE** to slow Django queries and add missing indexes before they cause production timeouts
- **Use APM or Sentry performance monitoring** to catch p99 latency regressions before they manifest as 504s at the Nginx boundary

Codes de statut associés

HTTP 408 HTTP 502

Termes associés

Timeout Load Balancer Reverse Proxy

504 Gateway Timeout — Upstream Not Responding