408 Request Timeout on Slow API Calls

Triệu chứng

- Server closes the connection and returns 408 before the client finishes sending the request body
- `curl: (28) Operation timed out` or `requests.exceptions.ReadTimeout` in client logs
- Timeouts appear intermittently, correlating with large payloads or heavy server load
- Keep-alive connections go idle and are closed mid-request by the server
- Nginx or Apache logs show `upstream timed out (110: Connection timed out)` entries

Nguyên nhân gốc rễ

Server-side `client_header_timeout` or `client_body_timeout` set too low (Nginx default: 60 s)
Slow client upload speed causing the request body to arrive after the timeout window
Idle keep-alive connections being reused after the server has already closed them
Upstream application server (Gunicorn, uWSGI) worker taking longer than proxy timeout allows
Network congestion or high latency between client and server inflating transfer time

Chẩn đoán

1. **Reproduce with curl and measure timing** to isolate which phase is timing out:
```bash
curl -v --max-time 30 -w '\nTotal: %{time_total}s\n' \
-X POST https://api.example.com/upload \
-H 'Content-Type: application/json' \
-d @large_payload.json
```

2. **Check Nginx timeout settings** on the server:
```bash
grep -r 'timeout' /etc/nginx/sites-available/ /etc/nginx/nginx.conf
# Key directives: client_body_timeout, client_header_timeout,
# proxy_read_timeout, proxy_send_timeout
```

3. **Inspect Gunicorn worker timeout** in the systemd service or config:
```bash
systemctl cat gunicorn-myapp | grep -i timeout
# Default: --timeout 30 (kills workers taking longer than 30 s)
```

4. **Test with a keep-alive header** to see if reused connections are the culprit:
```bash
curl -v --keepalive-time 10 -K - <<< 'next' https://api.example.com/ping
# If the second request fails with 408, keep-alive reuse is stale
```

5. **Profile the slow endpoint** to find the bottleneck:
```bash
# Enable Django's slow query log
grep -i 'duration' logs/app.log | awk '$NF > 5000' | tail -20
```

Giải quyết

**1. Increase Nginx timeout directives** for slow endpoints:
```nginx
# /etc/nginx/sites-available/myapp
location /api/upload/ {
proxy_pass http://127.0.0.1:8000;
proxy_read_timeout 120s; # was 60s
proxy_send_timeout 120s;
client_body_timeout 60s;
}
```

**2. Increase Gunicorn worker timeout** in the systemd service:
```ini
# /etc/systemd/system/gunicorn-myapp.service
[Service]
ExecStart=/var/www/myapp/.venv/bin/gunicorn \
--workers 4 \
--timeout 120 \
--graceful-timeout 30 \
config.wsgi:application
```
```bash
sudo systemctl daemon-reload && sudo systemctl restart gunicorn-myapp
```

**3. Disable keep-alive on the client side** if stale connections cause spurious 408s:
```python
import httpx
# Use a transport with keepalive disabled for unreliable connections
transport = httpx.HTTPTransport(retries=3)
client = httpx.Client(transport=transport, timeout=60.0)
```

**4. Move long-running work off the request path** to a background task:
```python
# views.py
from django_tasks import task

@task()
def process_upload(file_id: int) -> None:
... # slow processing here

def upload_view(request):
file_id = save_file(request)
process_upload.enqueue(file_id) # immediate return
return JsonResponse({'status': 'queued', 'id': file_id})
```

Phòng ngừa

- **Always set explicit timeouts on HTTP clients** — never rely on the default (often infinite): `httpx.Client(timeout=httpx.Timeout(connect=5.0, read=30.0, write=30.0))`
- **Use connection pooling with health checks** so stale keep-alive connections are detected and replaced before they cause 408s
- **Offload anything over 5 seconds** to a background queue — HTTP is not designed for long-running synchronous work
- **Monitor p95/p99 endpoint latency** in production and set timeout thresholds at 2–3x the p99 to provide headroom without masking real slowness

Mã trạng thái liên quan

HTTP 502 HTTP 504

Thuật ngữ liên quan

Keep-Alive Timeout Connection Pooling