StatusCodeFYI

Symptoms

- gRPC calls return `StatusCode.FAILED_PRECONDITION` with a message like "Order has already shipped — cannot cancel" or "Version mismatch"
- The error is non-deterministic — same request succeeds sometimes and fails others
- Concurrent operations on the same resource consistently trigger the error
- After a read-replica lag event, write operations fail because the client read stale data from the replica and is now making an invalid state transition
- A background job modified a resource between the client's GET and subsequent PUT

Root Causes

Client operating on stale data due to eventual consistency (read-replica lag)
Optimistic concurrency control rejecting the update because another writer modified the resource since the client's last read
State machine transition not valid from the resource's current state (e.g., cancelling an already-shipped order)
Race condition between two concurrent requests modifying the same resource
Missing or incorrect etag/version field in the update request causing the server to reject it as a precondition violation

Diagnosis

**Step 1 — Read the error details field**

```python
import grpc
from google.rpc import status_pb2, error_details_pb2

try:
stub.UpdateOrder(request)
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.FAILED_PRECONDITION:
# Rich error details via google.rpc.Status
status = status_pb2.Status()
for detail in e.trailing_metadata():
if detail.key == 'grpc-status-details-bin':
status.ParseFromString(detail.value)
print(status) # includes which precondition failed
```

**Step 2 — Identify the race condition in logs**

```bash
# Enable distributed tracing and find concurrent requests
# Look for two UpdateOrder requests for the same order_id within a short window
grep 'UpdateOrder.*order_id=42' app.log | head -20
```

**Step 3 — Check if optimistic locking is the cause**

```python
# In the request, include the etag/version from the last read
request = UpdateOrderRequest(
order_id='42',
status='CANCELLED',
etag=order.etag, # must match server's current etag
)
# If the server's etag differs, FAILED_PRECONDITION is correct behavior
```

Resolution

**Fix 1 — Retry with re-fetch (correct pattern for optimistic locking)**

```python
import grpc
MAX_RETRIES = 3

for attempt in range(MAX_RETRIES):
# Always re-fetch to get the latest state and etag
order = stub.GetOrder(GetOrderRequest(order_id='42'))

if order.status != 'PENDING':
raise ValueError(f'Cannot cancel order in state: {order.status}')

try:
stub.UpdateOrder(UpdateOrderRequest(
order_id=order.id,
status='CANCELLED',
etag=order.etag, # send back the etag from the read
))
break # success
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.FAILED_PRECONDITION and attempt < MAX_RETRIES - 1:
continue # retry with fresh data
raise
```

**Fix 2 — Use ABORTED for retryable conflicts, FAILED_PRECONDITION for invalid transitions**

```python
# Server-side: distinguish retryable from non-retryable precondition failures
import grpc

if order.version != request.etag:
# Concurrent modification — client should retry with new data
context.abort(grpc.StatusCode.ABORTED, 'Concurrent modification — retry')

if order.status == 'SHIPPED':
# Invalid state transition — retrying won't help
context.abort(grpc.StatusCode.FAILED_PRECONDITION, 'Order already shipped')
```

**Fix 3 — Read from the primary database before writes**

```python
# For write operations, bypass the read replica
# Django: use_primary=True for the read-before-write
with connections['primary'].cursor() as cursor:
order = Order.objects.using('primary').get(id=order_id)
```

Prevention

- Return the resource's `etag` or `version` on every read — require clients to send it back on updates
- Distinguish ABORTED (retry-safe) from FAILED_PRECONDITION (don't retry) in your error responses so clients know whether to retry
- For write operations, read state from the primary database replica, not a read replica — replica lag is a common source of stale state
- Model your domain as explicit state machines and validate transitions server-side — invalid state transitions should always return FAILED_PRECONDITION
- Use distributed tracing (Jaeger, Zipkin) to detect and analyze concurrent requests that trigger race conditions

Related Status Codes

gRPC 6 gRPC 10 gRPC 11

Related Terms

Idempotency Conditional Request ETag (Entity Tag) Retry Strategy Circuit Breaker gRPC Streaming

FAILED_PRECONDITION — Stale State in Distributed System