Saga Pattern: Managing Distributed Transactions with Compensating Actions

The Problem with Distributed Transactions

In a monolith with a single database, atomicity is free — wrap operations in a transaction and the database guarantees all-or-nothing. In microservices, each service owns its own database. Spanning a transaction across service boundaries requires a distributed protocol.

The classic solution is Two-Phase Commit (2PC):

Phase 1 (Prepare):
  Coordinator → OrderService:   'Prepare to commit order creation'
  Coordinator → InventoryService: 'Prepare to commit stock deduction'
  Coordinator → PaymentService:  'Prepare to commit charge'

Phase 2 (Commit or Abort):
  All services replied 'ready' → Coordinator sends 'Commit'
  Any service replied 'abort'  → Coordinator sends 'Rollback'

2PC has fundamental problems in microservice environments:

Availability: all participating services must be up simultaneously. If Payment is down, you cannot create any order.
Latency: two network round trips before any data is committed.
Lock contention: resources are locked during the prepare phase, blocking concurrent operations.
Coordinator SPOF: if the coordinator crashes after Phase 1, participants block indefinitely waiting for Phase 2.

The Saga pattern abandons distributed atomicity entirely. Instead of one atomic transaction, a saga is a sequence of local transactions with compensating transactions for rollback.

Choreography vs Orchestration

There are two ways to coordinate a saga:

Choreography — Event-Driven

Each service listens for events and publishes its own events. No central coordinator. Services react to each other:

OrderService         InventoryService        PaymentService
     |                      |                       |
  [Create Order]            |                       |
  PUBLISH OrderCreated ---->|                       |
                        [Reserve Stock]             |
                        PUBLISH StockReserved ----->|
                                                [Charge Card]
                                                PUBLISH PaymentSuccess
     |<------------------- PaymentSuccess ----------|
  [Confirm Order]           |                       |

On failure:

                                                [Charge Card — FAILED]
                                                PUBLISH PaymentFailed
                        [Receive PaymentFailed]
                        [Release Stock reservation]
                        PUBLISH StockReleased
     |<--------------- StockReleased -----------|
  [Cancel Order]            |

Choreography pros: no single point of failure, each service is independent, easy to add new participants by subscribing to events.

Choreography cons: difficult to understand the overall flow, no single place to monitor saga state, cyclic dependencies can emerge.

Orchestration — Central Coordinator

A saga orchestrator manages the sequence explicitly. It knows the entire workflow and sends commands to each service:

# Simplified saga orchestrator
from enum import Enum
from dataclasses import dataclass

class SagaState(Enum):
    STARTED = 'started'
    STOCK_RESERVED = 'stock_reserved'
    PAYMENT_TAKEN = 'payment_taken'
    COMPLETED = 'completed'
    COMPENSATING = 'compensating'
    FAILED = 'failed'

class OrderSagaOrchestrator:
    def __init__(self, order_id: str):
        self.order_id = order_id
        self.state = SagaState.STARTED

    async def execute(self) -> bool:
        try:
            # Step 1: Reserve stock
            reservation_id = await self._reserve_stock()
            self.state = SagaState.STOCK_RESERVED

            # Step 2: Charge payment
            payment_id = await self._charge_payment()
            self.state = SagaState.PAYMENT_TAKEN

            # Step 3: Confirm order
            await self._confirm_order()
            self.state = SagaState.COMPLETED
            return True

        except StockUnavailableError:
            # No compensation needed — stock never reserved
            self.state = SagaState.FAILED
            return False

        except PaymentDeclinedError:
            # Compensate: release the stock reservation
            self.state = SagaState.COMPENSATING
            await self._release_stock(reservation_id)
            self.state = SagaState.FAILED
            return False

Orchestration pros: single place to see saga state, easy to add logging/monitoring, straightforward to debug.

Orchestration cons: orchestrator can become a bottleneck or SPOF, creates coupling between orchestrator and all participant services.

Designing Compensating Actions

A compensating transaction undoes the effect of a completed transaction. Designing good compensations requires careful thought:

Compensations are not rollbacks. A SQL rollback is instant and invisible — as if the transaction never happened. A compensation is a new business transaction that reverses the effect. Between the original action and the compensation, other processes may have read the intermediate state.

# Original action: reserve stock
async def reserve_stock(item_id: str, quantity: int) -> str:
    reservation = await InventoryReservation.objects.acreate(
        item_id=item_id,
        quantity=quantity,
        status='reserved',
    )
    return str(reservation.id)

# Compensating action: release the reservation
async def release_stock(reservation_id: str) -> None:
    # Idempotent: safe to call multiple times
    await InventoryReservation.objects.filter(
        id=reservation_id,
        status='reserved',   # Only release if still reserved
    ).aupdate(status='cancelled')

Compensation ordering must be reverse of the forward sequence. If steps were A → B → C, compensations must execute C' → B' → A' to maintain consistency.

Some steps have no compensation. Sending an email, calling an external API, or firing a physical actuator cannot be truly undone. Design these as the last step in the saga (after all reversible steps complete) or accept that the compensation is a countermeasure (send a cancellation email, not an un-send).

Error Scenarios

Compensation Failure

What if the compensation itself fails? The saga is now stuck in a partially compensated state. Solutions:

Retry with exponential backoff: compensations should be idempotent and retried indefinitely (or until a dead letter queue threshold).
Dead letter queue: failed compensations go to a DLQ for manual review.
Alert and intervene: if after N retries the compensation still fails, page on-call for manual data correction.

Timeout and Partial Completion

If a step times out, did it complete or not? Always use idempotency keys so the retry of a step is safe regardless of what happened before:

async def charge_payment(order_id: str, amount_cents: int) -> str:
    # Idempotency key: if this request is retried, return the same result
    return await payment_client.charge(
        idempotency_key=f'order-{order_id}',
        amount=amount_cents,
    )

Production Implementations

AWS Step Functions provides a managed saga orchestrator. Define your workflow as a state machine in JSON, including catch/retry/compensate blocks. Execution history is persisted — you can inspect exactly where a saga failed.

Temporal is an open-source workflow engine that persists saga state to a database, replays it on worker restart, and handles retries, timeouts, and compensation in code (not config):

# Temporal workflow — saga with compensation
from temporalio import workflow, activity

@workflow.defn
class OrderSaga:
    @workflow.run
    async def run(self, order_id: str) -> str:
        reservation_id = None
        try:
            reservation_id = await workflow.execute_activity(
                reserve_stock, order_id, start_to_close_timeout=timedelta(seconds=10)
            )
            await workflow.execute_activity(
                charge_payment, order_id, start_to_close_timeout=timedelta(seconds=30)
            )
            return 'completed'
        except Exception:
            if reservation_id:
                await workflow.execute_activity(
                    release_stock, reservation_id,
                    start_to_close_timeout=timedelta(seconds=10),
                    retry_policy=RetryPolicy(maximum_attempts=10),
                )
            return 'failed'

Event sourcing integration makes sagas natural: each step appends an event to an event log. Saga state is derived by replaying events. Compensations append compensating events rather than mutating data.