Advanced 15 min gRPC 9

FAILED_PRECONDITION — Stale State in Distributed System

Symptoms

- gRPC calls return `StatusCode.FAILED_PRECONDITION` with a message like "Order has already shipped — cannot cancel" or "Version mismatch"
- The error is non-deterministic — same request succeeds sometimes and fails others
- Concurrent operations on the same resource consistently trigger the error
- After a read-replica lag event, write operations fail because the client read stale data from the replica and is now making an invalid state transition
- A background job modified a resource between the client's GET and subsequent PUT

Root Causes

  • Client operating on stale data due to eventual consistency (read-replica lag)
  • Optimistic concurrency control rejecting the update because another writer modified the resource since the client's last read
  • State machine transition not valid from the resource's current state (e.g., cancelling an already-shipped order)
  • Race condition between two concurrent requests modifying the same resource
  • Missing or incorrect etag/version field in the update request causing the server to reject it as a precondition violation

Diagnosis

**Step 1 — Read the error details field**

```python
import grpc
from google.rpc import status_pb2, error_details_pb2

try:
stub.UpdateOrder(request)
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.FAILED_PRECONDITION:
# Rich error details via google.rpc.Status
status = status_pb2.Status()
for detail in e.trailing_metadata():
if detail.key == 'grpc-status-details-bin':
status.ParseFromString(detail.value)
print(status) # includes which precondition failed
```

**Step 2 — Identify the race condition in logs**

```bash
# Enable distributed tracing and find concurrent requests
# Look for two UpdateOrder requests for the same order_id within a short window
grep 'UpdateOrder.*order_id=42' app.log | head -20
```

**Step 3 — Check if optimistic locking is the cause**

```python
# In the request, include the etag/version from the last read
request = UpdateOrderRequest(
order_id='42',
status='CANCELLED',
etag=order.etag, # must match server's current etag
)
# If the server's etag differs, FAILED_PRECONDITION is correct behavior
```

Resolution

**Fix 1 — Retry with re-fetch (correct pattern for optimistic locking)**

```python
import grpc
MAX_RETRIES = 3

for attempt in range(MAX_RETRIES):
# Always re-fetch to get the latest state and etag
order = stub.GetOrder(GetOrderRequest(order_id='42'))

if order.status != 'PENDING':
raise ValueError(f'Cannot cancel order in state: {order.status}')

try:
stub.UpdateOrder(UpdateOrderRequest(
order_id=order.id,
status='CANCELLED',
etag=order.etag, # send back the etag from the read
))
break # success
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.FAILED_PRECONDITION and attempt < MAX_RETRIES - 1:
continue # retry with fresh data
raise
```

**Fix 2 — Use ABORTED for retryable conflicts, FAILED_PRECONDITION for invalid transitions**

```python
# Server-side: distinguish retryable from non-retryable precondition failures
import grpc

if order.version != request.etag:
# Concurrent modification — client should retry with new data
context.abort(grpc.StatusCode.ABORTED, 'Concurrent modification — retry')

if order.status == 'SHIPPED':
# Invalid state transition — retrying won't help
context.abort(grpc.StatusCode.FAILED_PRECONDITION, 'Order already shipped')
```

**Fix 3 — Read from the primary database before writes**

```python
# For write operations, bypass the read replica
# Django: use_primary=True for the read-before-write
with connections['primary'].cursor() as cursor:
order = Order.objects.using('primary').get(id=order_id)
```

Prevention

- Return the resource's `etag` or `version` on every read — require clients to send it back on updates
- Distinguish ABORTED (retry-safe) from FAILED_PRECONDITION (don't retry) in your error responses so clients know whether to retry
- For write operations, read state from the primary database replica, not a read replica — replica lag is a common source of stale state
- Model your domain as explicit state machines and validate transitions server-side — invalid state transitions should always return FAILED_PRECONDITION
- Use distributed tracing (Jaeger, Zipkin) to detect and analyze concurrent requests that trigger race conditions

Related Status Codes

Related Terms