gRPC UNAVAILABLE — Connection Failures
症状
- Client throws `StatusCode.UNAVAILABLE: Connection refused` or `failed to connect to all addresses`
- gRPC channel state shows `TRANSIENT_FAILURE` in channel debug output
- DNS resolution failure: `StatusCode.UNAVAILABLE: Resolver transient failure`
- TLS errors: `StatusCode.UNAVAILABLE: SSL_ERROR_SYSCALL` or `certificate verify failed`
- Load balancer returns 503 and gRPC client surfaces it as UNAVAILABLE
- gRPC channel state shows `TRANSIENT_FAILURE` in channel debug output
- DNS resolution failure: `StatusCode.UNAVAILABLE: Resolver transient failure`
- TLS errors: `StatusCode.UNAVAILABLE: SSL_ERROR_SYSCALL` or `certificate verify failed`
- Load balancer returns 503 and gRPC client surfaces it as UNAVAILABLE
根本原因
- gRPC server process is not running or is listening on a different port
- DNS name for the gRPC endpoint does not resolve or points to a wrong IP
- TLS certificate mismatch between client trust store and server certificate
- Load balancer health check configured for HTTP/1.1 instead of gRPC (HTTP/2)
- Kubernetes Service or Ingress not routing to the correct pod or port
诊断
**Step 1 — Verify the server is listening**
```bash
# On the server host
ss -tlnp | grep 50051
# or
netstat -tlnp | grep 50051
# Expected: server process listening on 0.0.0.0:50051 or :::50051
```
**Step 2 — Test connectivity with grpcurl**
```bash
# Install: https://github.com/fullstorydev/grpcurl
# Plain text
grpcurl -plaintext localhost:50051 list
# TLS (insecure for self-signed certs)
grpcurl -insecure myservice.example.com:443 list
# With CA cert
grpcurl -cacert ca.crt myservice.example.com:443 list
```
**Step 3 — Check DNS resolution**
```bash
dig myservice.example.com +short
# Must return an IP; empty = DNS failure
# In Kubernetes
kubectl exec -it debug-pod -- nslookup my-grpc-service.default.svc.cluster.local
```
**Step 4 — Inspect TLS certificate**
```bash
openssl s_client -connect myservice.example.com:443 -alpn h2 </dev/null 2>&1 \
| openssl x509 -noout -text | grep -E 'Subject|Issuer|Not After|DNS:'
# Verify CN/SAN matches the hostname the client is dialing
```
**Step 5 — Enable gRPC channel tracing**
```python
import os
os.environ['GRPC_VERBOSITY'] = 'DEBUG'
os.environ['GRPC_TRACE'] = 'all'
# Re-run your client to see detailed channel state transitions
```
```bash
# On the server host
ss -tlnp | grep 50051
# or
netstat -tlnp | grep 50051
# Expected: server process listening on 0.0.0.0:50051 or :::50051
```
**Step 2 — Test connectivity with grpcurl**
```bash
# Install: https://github.com/fullstorydev/grpcurl
# Plain text
grpcurl -plaintext localhost:50051 list
# TLS (insecure for self-signed certs)
grpcurl -insecure myservice.example.com:443 list
# With CA cert
grpcurl -cacert ca.crt myservice.example.com:443 list
```
**Step 3 — Check DNS resolution**
```bash
dig myservice.example.com +short
# Must return an IP; empty = DNS failure
# In Kubernetes
kubectl exec -it debug-pod -- nslookup my-grpc-service.default.svc.cluster.local
```
**Step 4 — Inspect TLS certificate**
```bash
openssl s_client -connect myservice.example.com:443 -alpn h2 </dev/null 2>&1 \
| openssl x509 -noout -text | grep -E 'Subject|Issuer|Not After|DNS:'
# Verify CN/SAN matches the hostname the client is dialing
```
**Step 5 — Enable gRPC channel tracing**
```python
import os
os.environ['GRPC_VERBOSITY'] = 'DEBUG'
os.environ['GRPC_TRACE'] = 'all'
# Re-run your client to see detailed channel state transitions
```
解决
**Fix 1 — Start the gRPC server**
```bash
# Check service status
sudo systemctl status my-grpc-server
sudo systemctl start my-grpc-server
```
**Fix 2 — Fix DNS / service discovery**
```yaml
# Kubernetes Service example
apiVersion: v1
kind: Service
metadata:
name: my-grpc-service
spec:
selector:
app: my-grpc-server
ports:
- port: 50051
targetPort: 50051
protocol: TCP
```
**Fix 3 — Fix TLS credential mismatch**
```python
import grpc
# Client: load matching CA certificate
with open('ca.crt', 'rb') as f:
ca_cert = f.read()
creds = grpc.ssl_channel_credentials(root_certificates=ca_cert)
channel = grpc.secure_channel('myservice.example.com:443', creds)
```
**Fix 4 — Configure load balancer for gRPC (HTTP/2)**
```nginx
# Nginx upstream for gRPC
upstream grpc_backend {
server localhost:50051;
}
server {
listen 443 ssl http2;
location / {
grpc_pass grpc://grpc_backend;
}
}
```
**Fix 5 — Add retry policy for transient failures**
```python
import json, grpc
service_config = json.dumps({
'methodConfig': [{
'name': [{'service': 'mypackage.MyService'}],
'retryPolicy': {
'maxAttempts': 4,
'initialBackoff': '0.1s',
'maxBackoff': '1s',
'backoffMultiplier': 2,
'retryableStatusCodes': ['UNAVAILABLE'],
},
}]
})
channel = grpc.insecure_channel(
'localhost:50051',
options=[('grpc.service_config', service_config)],
)
```
```bash
# Check service status
sudo systemctl status my-grpc-server
sudo systemctl start my-grpc-server
```
**Fix 2 — Fix DNS / service discovery**
```yaml
# Kubernetes Service example
apiVersion: v1
kind: Service
metadata:
name: my-grpc-service
spec:
selector:
app: my-grpc-server
ports:
- port: 50051
targetPort: 50051
protocol: TCP
```
**Fix 3 — Fix TLS credential mismatch**
```python
import grpc
# Client: load matching CA certificate
with open('ca.crt', 'rb') as f:
ca_cert = f.read()
creds = grpc.ssl_channel_credentials(root_certificates=ca_cert)
channel = grpc.secure_channel('myservice.example.com:443', creds)
```
**Fix 4 — Configure load balancer for gRPC (HTTP/2)**
```nginx
# Nginx upstream for gRPC
upstream grpc_backend {
server localhost:50051;
}
server {
listen 443 ssl http2;
location / {
grpc_pass grpc://grpc_backend;
}
}
```
**Fix 5 — Add retry policy for transient failures**
```python
import json, grpc
service_config = json.dumps({
'methodConfig': [{
'name': [{'service': 'mypackage.MyService'}],
'retryPolicy': {
'maxAttempts': 4,
'initialBackoff': '0.1s',
'maxBackoff': '1s',
'backoffMultiplier': 2,
'retryableStatusCodes': ['UNAVAILABLE'],
},
}]
})
channel = grpc.insecure_channel(
'localhost:50051',
options=[('grpc.service_config', service_config)],
)
```
预防
- Implement gRPC health checking protocol (grpc.health.v1) and wire it to your load balancer
- Use client-side retry policies with exponential backoff for UNAVAILABLE status codes
- Monitor channel connectivity state with channel.subscribe() or Prometheus gRPC metrics
- Test TLS configuration in CI using grpcurl before deploying to production
- In Kubernetes, use readiness probes with grpc health check to prevent traffic before server is ready
- Use client-side retry policies with exponential backoff for UNAVAILABLE status codes
- Monitor channel connectivity state with channel.subscribe() or Prometheus gRPC metrics
- Test TLS configuration in CI using grpcurl before deploying to production
- In Kubernetes, use readiness probes with grpc health check to prevent traffic before server is ready