1013 Try Again Later — Server Overloaded
लक्षण
- New WebSocket connections are immediately closed with code 1013 and reason 'Try Again Later'
- Existing connections continue working while new ones are rejected
- Server CPU or memory is at or near capacity
- Connection count in monitoring shows the server has hit its configured max concurrent connections
- Burst of reconnection attempts after an outage triggers a thundering herd, amplifying the overload
- Existing connections continue working while new ones are rejected
- Server CPU or memory is at or near capacity
- Connection count in monitoring shows the server has hit its configured max concurrent connections
- Burst of reconnection attempts after an outage triggers a thundering herd, amplifying the overload
मूल कारण
- Server's maximum WebSocket connection limit has been reached
- Memory usage is too high — server proactively closes connections to avoid OOM kill
- Event loop lag in Node.js or thread pool exhaustion in a Java server, rejecting new connections
- Thundering herd — many clients reconnecting simultaneously after an outage overload the server
- No connection backpressure or admission control mechanism to gracefully shed load
निदान
**Step 1 — Count active WebSocket connections**
```bash
# Count established WebSocket connections on the server:
ss -s
# Or for per-process breakdown:
ss -tp | grep ':8001' | wc -l
# For Node.js — check wss.clients.size:
# wss.clients is a Set of active connections
console.log('Active connections:', wss.clients.size);
```
**Step 2 — Check server resource usage**
```bash
# Memory:
free -h
cat /proc/$(pgrep -f gunicorn)/status | grep VmRSS
# CPU and memory per process:
ps aux --sort=-%mem | head -10
# File descriptor limits (each WebSocket = 1 fd):
cat /proc/$(pgrep -f yourapp)/limits | grep 'open files'
ulimit -n # current fd limit
```
**Step 3 — Observe the thundering herd pattern**
```bash
# Watch connection rate per second after an outage:
watch -n 1 'ss -s | grep estab'
# Check if all reconnects hit within the same 1-second window:
# In application logs, count connection events per second
awk '/WebSocket connected/{t=substr($1,1,19); count[t]++} END{for(k in count) print k, count[k]}' \
/var/log/myapp/app.log | sort | tail -20
```
**Step 4 — Identify the configured connection limit**
```javascript
// Node.js ws — check if a max is set:
const wss = new WebSocket.Server({
maxConnections: 10000 // ← find this value
});
```
```python
# Django Channels — check channel layer capacity:
CHANNEL_LAYERS = {
'default': {
'BACKEND': 'channels_redis.core.RedisChannelLayer',
'CONFIG': {
'capacity': 1000, # messages per channel before backpressure
},
}
}
```
**Step 5 — Check Node.js event loop lag**
```javascript
// Measure event loop lag:
const { monitorEventLoopDelay } = require('perf_hooks');
const h = monitorEventLoopDelay({ resolution: 10 });
h.enable();
setInterval(() => {
console.log('Event loop lag P99:', h.percentile(99), 'ms');
h.reset();
}, 5000);
// > 100ms lag = event loop is saturated
```
```bash
# Count established WebSocket connections on the server:
ss -s
# Or for per-process breakdown:
ss -tp | grep ':8001' | wc -l
# For Node.js — check wss.clients.size:
# wss.clients is a Set of active connections
console.log('Active connections:', wss.clients.size);
```
**Step 2 — Check server resource usage**
```bash
# Memory:
free -h
cat /proc/$(pgrep -f gunicorn)/status | grep VmRSS
# CPU and memory per process:
ps aux --sort=-%mem | head -10
# File descriptor limits (each WebSocket = 1 fd):
cat /proc/$(pgrep -f yourapp)/limits | grep 'open files'
ulimit -n # current fd limit
```
**Step 3 — Observe the thundering herd pattern**
```bash
# Watch connection rate per second after an outage:
watch -n 1 'ss -s | grep estab'
# Check if all reconnects hit within the same 1-second window:
# In application logs, count connection events per second
awk '/WebSocket connected/{t=substr($1,1,19); count[t]++} END{for(k in count) print k, count[k]}' \
/var/log/myapp/app.log | sort | tail -20
```
**Step 4 — Identify the configured connection limit**
```javascript
// Node.js ws — check if a max is set:
const wss = new WebSocket.Server({
maxConnections: 10000 // ← find this value
});
```
```python
# Django Channels — check channel layer capacity:
CHANNEL_LAYERS = {
'default': {
'BACKEND': 'channels_redis.core.RedisChannelLayer',
'CONFIG': {
'capacity': 1000, # messages per channel before backpressure
},
}
}
```
**Step 5 — Check Node.js event loop lag**
```javascript
// Measure event loop lag:
const { monitorEventLoopDelay } = require('perf_hooks');
const h = monitorEventLoopDelay({ resolution: 10 });
h.enable();
setInterval(() => {
console.log('Event loop lag P99:', h.percentile(99), 'ms');
h.reset();
}, 5000);
// > 100ms lag = event loop is saturated
```
समाधान
**Fix 1 — Client-side: exponential backoff on 1013**
```javascript
function connectWithBackoff(url) {
let attempt = 0;
const MAX_DELAY = 60_000;
function connect() {
const ws = new WebSocket(url);
ws.onclose = (e) => {
if (e.code === 1013) {
// Server asked us to try later — use backoff
const jitter = Math.random() * 1000;
const delay = Math.min(1000 * 2 ** attempt + jitter, MAX_DELAY);
attempt++;
console.log(`Server overloaded, retrying in ${Math.round(delay)}ms`);
setTimeout(connect, delay);
} else {
attempt = 0; // reset on non-1013 close
}
};
ws.onopen = () => { attempt = 0; };
}
connect();
}
```
**Fix 2 — Server-side: implement connection admission control**
```python
# Django Channels — limit concurrent connections:
MAX_CONNECTIONS = 5000
_active_connections = 0
class AdmissionConsumer(AsyncWebsocketConsumer):
async def connect(self):
global _active_connections
if _active_connections >= MAX_CONNECTIONS:
await self.close(code=1013)
return
_active_connections += 1
await self.accept()
async def disconnect(self, close_code: int) -> None:
global _active_connections
_active_connections = max(0, _active_connections - 1)
```
**Fix 3 — Scale horizontally with sticky sessions**
```nginx
# Nginx upstream with WebSocket-aware load balancing:
upstream websocket_backends {
ip_hash; # sticky sessions — client always hits same backend
server backend1.example.com:8001;
server backend2.example.com:8001;
server backend3.example.com:8001;
}
# Or use least_conn for more even distribution:
upstream websocket_backends {
least_conn;
server backend1.example.com:8001;
server backend2.example.com:8001;
}
```
**Fix 4 — Increase OS file descriptor limits**
```bash
# /etc/security/limits.conf — raise fd limit for the app user:
echo 'www-data soft nofile 65536' >> /etc/security/limits.conf
echo 'www-data hard nofile 65536' >> /etc/security/limits.conf
# systemd service override:
sudo systemctl edit gunicorn-myapp
# Add:
# [Service]
# LimitNOFILE=65536
sudo systemctl daemon-reload && sudo systemctl restart gunicorn-myapp
```
```javascript
function connectWithBackoff(url) {
let attempt = 0;
const MAX_DELAY = 60_000;
function connect() {
const ws = new WebSocket(url);
ws.onclose = (e) => {
if (e.code === 1013) {
// Server asked us to try later — use backoff
const jitter = Math.random() * 1000;
const delay = Math.min(1000 * 2 ** attempt + jitter, MAX_DELAY);
attempt++;
console.log(`Server overloaded, retrying in ${Math.round(delay)}ms`);
setTimeout(connect, delay);
} else {
attempt = 0; // reset on non-1013 close
}
};
ws.onopen = () => { attempt = 0; };
}
connect();
}
```
**Fix 2 — Server-side: implement connection admission control**
```python
# Django Channels — limit concurrent connections:
MAX_CONNECTIONS = 5000
_active_connections = 0
class AdmissionConsumer(AsyncWebsocketConsumer):
async def connect(self):
global _active_connections
if _active_connections >= MAX_CONNECTIONS:
await self.close(code=1013)
return
_active_connections += 1
await self.accept()
async def disconnect(self, close_code: int) -> None:
global _active_connections
_active_connections = max(0, _active_connections - 1)
```
**Fix 3 — Scale horizontally with sticky sessions**
```nginx
# Nginx upstream with WebSocket-aware load balancing:
upstream websocket_backends {
ip_hash; # sticky sessions — client always hits same backend
server backend1.example.com:8001;
server backend2.example.com:8001;
server backend3.example.com:8001;
}
# Or use least_conn for more even distribution:
upstream websocket_backends {
least_conn;
server backend1.example.com:8001;
server backend2.example.com:8001;
}
```
**Fix 4 — Increase OS file descriptor limits**
```bash
# /etc/security/limits.conf — raise fd limit for the app user:
echo 'www-data soft nofile 65536' >> /etc/security/limits.conf
echo 'www-data hard nofile 65536' >> /etc/security/limits.conf
# systemd service override:
sudo systemctl edit gunicorn-myapp
# Add:
# [Service]
# LimitNOFILE=65536
sudo systemctl daemon-reload && sudo systemctl restart gunicorn-myapp
```
रोकथाम
- Implement client-side exponential backoff with jitter for all WebSocket reconnect logic
- Set a maximum connection limit on the server and monitor it against capacity headroom
- Use horizontal scaling with a shared pub/sub backend (Redis, NATS) so connections are distributed across multiple server processes
- Add connection rate limiting at the load balancer level to prevent thundering herd after outages
- Alert when active connection count exceeds 70% of the server's capacity ceiling
- Set a maximum connection limit on the server and monitor it against capacity headroom
- Use horizontal scaling with a shared pub/sub backend (Redis, NATS) so connections are distributed across multiple server processes
- Add connection rate limiting at the load balancer level to prevent thundering herd after outages
- Alert when active connection count exceeds 70% of the server's capacity ceiling