Intermediate 10 min SIP 503

503 Service Unavailable — SIP Trunk Overloaded

Symptome

- Outbound calls fail immediately with SIP 503 Service Unavailable during peak hours while lower-volume periods succeed normally
- Some calls go through while others receive 503 — indicates the trunk is at or near its concurrent channel limit, not completely down
- SIP provider dashboard or CDR reports show the trunk at 100% capacity when failures occur
- sngrep or SIP trace shows 503 arriving from the SIP provider's proxy IP within milliseconds of the INVITE — rejected before processing
- Retry-After header may be present in the 503 response, indicating when the server expects to be available again

Grundursachen

  • SIP trunk concurrent channel limit exceeded — the provider's contract allows N simultaneous calls and all channels are occupied when the new INVITE arrives
  • SIP provider is experiencing an outage or scheduled maintenance and is returning 503 to all new requests during the window
  • DNS SRV record for the SIP trunk resolves to a server that is down — no failover to secondary SRV targets because the client is not correctly implementing SRV priority/weight failover
  • SIP proxy is overloaded and applying load shedding — returning 503 with Retry-After to protect downstream servers from being overwhelmed
  • Network congestion or packet loss causing SIP UDP packets to be dropped before reaching the provider, leaving stale call sessions that count against the channel limit

Diagnose

**Step 1: Confirm it's a capacity issue, not an outage**
```bash
# Check current active calls on your PBX (Asterisk example)
asterisk -rx 'core show channels count'
# Compare with your trunk's contracted concurrent call limit
# If at or near limit → capacity issue
# If well below limit → provider outage or configuration issue
```

**Step 2: Read the 503 response headers**
```bash
sngrep -d eth0 port 5060
# Look for Retry-After header in the 503 response
# Retry-After: 60 → server will accept again in 60 seconds
# No Retry-After → could be a permanent config or outage issue
```

**Step 3: Verify DNS SRV resolution for the SIP trunk**
```bash
# SIP providers typically publish SRV records
dig SRV _sip._udp.provider.example.com
dig SRV _sip._tcp.provider.example.com
# Confirm all SRV targets resolve and are reachable
for host in sip1.provider.com sip2.provider.com; do
echo -n "$host: "; nc -uz $host 5060 && echo OK || echo FAIL
done
```

**Step 4: Check provider status page**

Most SIP providers have a status page or support channel. If the 503 started at a fixed time and affects all calls, it is likely a provider outage rather than a capacity issue.

**Step 5: Review your call routing config for zombie sessions**
```bash
# In Asterisk, look for calls stuck in non-up states
asterisk -rx 'core show channels verbose'
# Kill stuck channels
asterisk -rx 'channel request hangup SIP/provider-00000001'
```

Lösung

**Fix 1: Increase trunk capacity with your SIP provider**

Contact your SIP provider to increase the concurrent channel limit. Most providers allow on-demand scaling. As a short-term workaround, shorten call timeouts to free channels faster.

**Fix 2: Configure SIP trunk failover to a secondary provider**
```ini
# Asterisk pjsip.conf — define two trunks with failover
[primary_trunk]
type=endpoint
outbound_auth=primary_auth
aors=primary_aor

[secondary_trunk]
type=endpoint
outbound_auth=secondary_auth
aors=secondary_aor

# extensions.conf — try primary, fall back to secondary on 503
exten => _X.,1,Dial(PJSIP/${EXTEN}@primary_trunk)
same => n,GoToIf($["${HANGUPCAUSE}" = "38"]?failover)
same => n(failover),Dial(PJSIP/${EXTEN}@secondary_trunk)
```

**Fix 3: Implement call queuing instead of hard rejection**
```ini
# Asterisk queues.conf — queue callers when all agents/trunks are busy
[outbound-queue]
strategy=ringall
maxlen=20
retry=5
timeout=30
```

**Fix 4: Honor Retry-After and implement exponential backoff**
```python
# In a SIP application layer
def handle_503(response):
retry_after = int(response.headers.get('Retry-After', 5))
time.sleep(retry_after)
return retry_call()
```

Prävention

- **Monitor concurrent call counts** in real time and alert when usage exceeds 80% of the trunk capacity limit
- **Provision redundant SIP trunks** from separate providers so that 503 from one provider triggers automatic failover to the second
- **Implement call admission control (CAC)** in the PBX to prevent exceeding the trunk limit before sending INVITEs that will be rejected
- **Use TCP or TLS transport** instead of UDP for the SIP trunk — TCP ensures retransmission and avoids packet-loss-caused ghost sessions
- **Test failover regularly** by simulating a 503 response in a staging environment to verify your dial plan handles it correctly

Verwandte Statuscodes

Verwandte Begriffe