The Problem with Distributed Transactions
In a monolith with a single database, atomicity is free — wrap operations in a transaction and the database guarantees all-or-nothing. In microservices, each service owns its own database. Spanning a transaction across service boundaries requires a distributed protocol.
The classic solution is Two-Phase Commit (2PC):
Phase 1 (Prepare):
Coordinator → OrderService: 'Prepare to commit order creation'
Coordinator → InventoryService: 'Prepare to commit stock deduction'
Coordinator → PaymentService: 'Prepare to commit charge'
Phase 2 (Commit or Abort):
All services replied 'ready' → Coordinator sends 'Commit'
Any service replied 'abort' → Coordinator sends 'Rollback'
2PC has fundamental problems in microservice environments:
- Availability: all participating services must be up simultaneously. If Payment is down, you cannot create any order.
- Latency: two network round trips before any data is committed.
- Lock contention: resources are locked during the prepare phase, blocking concurrent operations.
- Coordinator SPOF: if the coordinator crashes after Phase 1, participants block indefinitely waiting for Phase 2.
The Saga pattern abandons distributed atomicity entirely. Instead of one atomic transaction, a saga is a sequence of local transactions with compensating transactions for rollback.
Choreography vs Orchestration
There are two ways to coordinate a saga:
Choreography — Event-Driven
Each service listens for events and publishes its own events. No central coordinator. Services react to each other:
OrderService InventoryService PaymentService
| | |
[Create Order] | |
PUBLISH OrderCreated ---->| |
[Reserve Stock] |
PUBLISH StockReserved ----->|
[Charge Card]
PUBLISH PaymentSuccess
|<------------------- PaymentSuccess ----------|
[Confirm Order] | |
On failure:
[Charge Card — FAILED]
PUBLISH PaymentFailed
[Receive PaymentFailed]
[Release Stock reservation]
PUBLISH StockReleased
|<--------------- StockReleased -----------|
[Cancel Order] |
Choreography pros: no single point of failure, each service is independent, easy to add new participants by subscribing to events.
Choreography cons: difficult to understand the overall flow, no single place to monitor saga state, cyclic dependencies can emerge.
Orchestration — Central Coordinator
A saga orchestrator manages the sequence explicitly. It knows the entire workflow and sends commands to each service:
# Simplified saga orchestrator
from enum import Enum
from dataclasses import dataclass
class SagaState(Enum):
STARTED = 'started'
STOCK_RESERVED = 'stock_reserved'
PAYMENT_TAKEN = 'payment_taken'
COMPLETED = 'completed'
COMPENSATING = 'compensating'
FAILED = 'failed'
class OrderSagaOrchestrator:
def __init__(self, order_id: str):
self.order_id = order_id
self.state = SagaState.STARTED
async def execute(self) -> bool:
try:
# Step 1: Reserve stock
reservation_id = await self._reserve_stock()
self.state = SagaState.STOCK_RESERVED
# Step 2: Charge payment
payment_id = await self._charge_payment()
self.state = SagaState.PAYMENT_TAKEN
# Step 3: Confirm order
await self._confirm_order()
self.state = SagaState.COMPLETED
return True
except StockUnavailableError:
# No compensation needed — stock never reserved
self.state = SagaState.FAILED
return False
except PaymentDeclinedError:
# Compensate: release the stock reservation
self.state = SagaState.COMPENSATING
await self._release_stock(reservation_id)
self.state = SagaState.FAILED
return False
Orchestration pros: single place to see saga state, easy to add logging/monitoring, straightforward to debug.
Orchestration cons: orchestrator can become a bottleneck or SPOF, creates coupling between orchestrator and all participant services.
Designing Compensating Actions
A compensating transaction undoes the effect of a completed transaction. Designing good compensations requires careful thought:
Compensations are not rollbacks. A SQL rollback is instant and invisible — as if the transaction never happened. A compensation is a new business transaction that reverses the effect. Between the original action and the compensation, other processes may have read the intermediate state.
# Original action: reserve stock
async def reserve_stock(item_id: str, quantity: int) -> str:
reservation = await InventoryReservation.objects.acreate(
item_id=item_id,
quantity=quantity,
status='reserved',
)
return str(reservation.id)
# Compensating action: release the reservation
async def release_stock(reservation_id: str) -> None:
# Idempotent: safe to call multiple times
await InventoryReservation.objects.filter(
id=reservation_id,
status='reserved', # Only release if still reserved
).aupdate(status='cancelled')
Compensation ordering must be reverse of the forward sequence. If steps were A → B → C, compensations must execute C' → B' → A' to maintain consistency.
Some steps have no compensation. Sending an email, calling an external API, or firing a physical actuator cannot be truly undone. Design these as the last step in the saga (after all reversible steps complete) or accept that the compensation is a countermeasure (send a cancellation email, not an un-send).
Error Scenarios
Compensation Failure
What if the compensation itself fails? The saga is now stuck in a partially compensated state. Solutions:
- Retry with exponential backoff: compensations should be idempotent and retried indefinitely (or until a dead letter queue threshold).
- Dead letter queue: failed compensations go to a DLQ for manual review.
- Alert and intervene: if after N retries the compensation still fails, page on-call for manual data correction.
Timeout and Partial Completion
If a step times out, did it complete or not? Always use idempotency keys so the retry of a step is safe regardless of what happened before:
async def charge_payment(order_id: str, amount_cents: int) -> str:
# Idempotency key: if this request is retried, return the same result
return await payment_client.charge(
idempotency_key=f'order-{order_id}',
amount=amount_cents,
)
Production Implementations
AWS Step Functions provides a managed saga orchestrator. Define your workflow as a state machine in JSON, including catch/retry/compensate blocks. Execution history is persisted — you can inspect exactly where a saga failed.
Temporal is an open-source workflow engine that persists saga state to a database, replays it on worker restart, and handles retries, timeouts, and compensation in code (not config):
# Temporal workflow — saga with compensation
from temporalio import workflow, activity
@workflow.defn
class OrderSaga:
@workflow.run
async def run(self, order_id: str) -> str:
reservation_id = None
try:
reservation_id = await workflow.execute_activity(
reserve_stock, order_id, start_to_close_timeout=timedelta(seconds=10)
)
await workflow.execute_activity(
charge_payment, order_id, start_to_close_timeout=timedelta(seconds=30)
)
return 'completed'
except Exception:
if reservation_id:
await workflow.execute_activity(
release_stock, reservation_id,
start_to_close_timeout=timedelta(seconds=10),
retry_policy=RetryPolicy(maximum_attempts=10),
)
return 'failed'
Event sourcing integration makes sagas natural: each step appends an event to an event log. Saga state is derived by replaying events. Compensations append compensating events rather than mutating data.