Error Handling Patterns

Saga Pattern: Managing Distributed Transactions with Compensating Actions

How the saga pattern replaces distributed transactions with a sequence of local transactions and compensating actions for rollback.

The Problem with Distributed Transactions

In a monolith with a single database, atomicity is free — wrap operations in a transaction and the database guarantees all-or-nothing. In microservices, each service owns its own database. Spanning a transaction across service boundaries requires a distributed protocol.

The classic solution is Two-Phase Commit (2PC):

Phase 1 (Prepare):
  Coordinator → OrderService:   'Prepare to commit order creation'
  Coordinator → InventoryService: 'Prepare to commit stock deduction'
  Coordinator → PaymentService:  'Prepare to commit charge'

Phase 2 (Commit or Abort):
  All services replied 'ready' → Coordinator sends 'Commit'
  Any service replied 'abort'  → Coordinator sends 'Rollback'

2PC has fundamental problems in microservice environments:

  • Availability: all participating services must be up simultaneously. If Payment is down, you cannot create any order.
  • Latency: two network round trips before any data is committed.
  • Lock contention: resources are locked during the prepare phase, blocking concurrent operations.
  • Coordinator SPOF: if the coordinator crashes after Phase 1, participants block indefinitely waiting for Phase 2.

The Saga pattern abandons distributed atomicity entirely. Instead of one atomic transaction, a saga is a sequence of local transactions with compensating transactions for rollback.

Choreography vs Orchestration

There are two ways to coordinate a saga:

Choreography — Event-Driven

Each service listens for events and publishes its own events. No central coordinator. Services react to each other:

OrderService         InventoryService        PaymentService
     |                      |                       |
  [Create Order]            |                       |
  PUBLISH OrderCreated ---->|                       |
                        [Reserve Stock]             |
                        PUBLISH StockReserved ----->|
                                                [Charge Card]
                                                PUBLISH PaymentSuccess
     |<------------------- PaymentSuccess ----------|
  [Confirm Order]           |                       |

On failure:

                                                [Charge Card — FAILED]
                                                PUBLISH PaymentFailed
                        [Receive PaymentFailed]
                        [Release Stock reservation]
                        PUBLISH StockReleased
     |<--------------- StockReleased -----------|
  [Cancel Order]            |

Choreography pros: no single point of failure, each service is independent, easy to add new participants by subscribing to events.

Choreography cons: difficult to understand the overall flow, no single place to monitor saga state, cyclic dependencies can emerge.

Orchestration — Central Coordinator

A saga orchestrator manages the sequence explicitly. It knows the entire workflow and sends commands to each service:

# Simplified saga orchestrator
from enum import Enum
from dataclasses import dataclass

class SagaState(Enum):
    STARTED = 'started'
    STOCK_RESERVED = 'stock_reserved'
    PAYMENT_TAKEN = 'payment_taken'
    COMPLETED = 'completed'
    COMPENSATING = 'compensating'
    FAILED = 'failed'

class OrderSagaOrchestrator:
    def __init__(self, order_id: str):
        self.order_id = order_id
        self.state = SagaState.STARTED

    async def execute(self) -> bool:
        try:
            # Step 1: Reserve stock
            reservation_id = await self._reserve_stock()
            self.state = SagaState.STOCK_RESERVED

            # Step 2: Charge payment
            payment_id = await self._charge_payment()
            self.state = SagaState.PAYMENT_TAKEN

            # Step 3: Confirm order
            await self._confirm_order()
            self.state = SagaState.COMPLETED
            return True

        except StockUnavailableError:
            # No compensation needed — stock never reserved
            self.state = SagaState.FAILED
            return False

        except PaymentDeclinedError:
            # Compensate: release the stock reservation
            self.state = SagaState.COMPENSATING
            await self._release_stock(reservation_id)
            self.state = SagaState.FAILED
            return False

Orchestration pros: single place to see saga state, easy to add logging/monitoring, straightforward to debug.

Orchestration cons: orchestrator can become a bottleneck or SPOF, creates coupling between orchestrator and all participant services.

Designing Compensating Actions

A compensating transaction undoes the effect of a completed transaction. Designing good compensations requires careful thought:

Compensations are not rollbacks. A SQL rollback is instant and invisible — as if the transaction never happened. A compensation is a new business transaction that reverses the effect. Between the original action and the compensation, other processes may have read the intermediate state.

# Original action: reserve stock
async def reserve_stock(item_id: str, quantity: int) -> str:
    reservation = await InventoryReservation.objects.acreate(
        item_id=item_id,
        quantity=quantity,
        status='reserved',
    )
    return str(reservation.id)

# Compensating action: release the reservation
async def release_stock(reservation_id: str) -> None:
    # Idempotent: safe to call multiple times
    await InventoryReservation.objects.filter(
        id=reservation_id,
        status='reserved',   # Only release if still reserved
    ).aupdate(status='cancelled')

Compensation ordering must be reverse of the forward sequence. If steps were A → B → C, compensations must execute C' → B' → A' to maintain consistency.

Some steps have no compensation. Sending an email, calling an external API, or firing a physical actuator cannot be truly undone. Design these as the last step in the saga (after all reversible steps complete) or accept that the compensation is a countermeasure (send a cancellation email, not an un-send).

Error Scenarios

Compensation Failure

What if the compensation itself fails? The saga is now stuck in a partially compensated state. Solutions:

  • Retry with exponential backoff: compensations should be idempotent and retried indefinitely (or until a dead letter queue threshold).
  • Dead letter queue: failed compensations go to a DLQ for manual review.
  • Alert and intervene: if after N retries the compensation still fails, page on-call for manual data correction.

Timeout and Partial Completion

If a step times out, did it complete or not? Always use idempotency keys so the retry of a step is safe regardless of what happened before:

async def charge_payment(order_id: str, amount_cents: int) -> str:
    # Idempotency key: if this request is retried, return the same result
    return await payment_client.charge(
        idempotency_key=f'order-{order_id}',
        amount=amount_cents,
    )

Production Implementations

AWS Step Functions provides a managed saga orchestrator. Define your workflow as a state machine in JSON, including catch/retry/compensate blocks. Execution history is persisted — you can inspect exactly where a saga failed.

Temporal is an open-source workflow engine that persists saga state to a database, replays it on worker restart, and handles retries, timeouts, and compensation in code (not config):

# Temporal workflow — saga with compensation
from temporalio import workflow, activity

@workflow.defn
class OrderSaga:
    @workflow.run
    async def run(self, order_id: str) -> str:
        reservation_id = None
        try:
            reservation_id = await workflow.execute_activity(
                reserve_stock, order_id, start_to_close_timeout=timedelta(seconds=10)
            )
            await workflow.execute_activity(
                charge_payment, order_id, start_to_close_timeout=timedelta(seconds=30)
            )
            return 'completed'
        except Exception:
            if reservation_id:
                await workflow.execute_activity(
                    release_stock, reservation_id,
                    start_to_close_timeout=timedelta(seconds=10),
                    retry_policy=RetryPolicy(maximum_attempts=10),
                )
            return 'failed'

Event sourcing integration makes sagas natural: each step appends an event to an event log. Saga state is derived by replaying events. Compensations append compensating events rather than mutating data.

البروتوكولات ذات الصلة

مصطلحات ذات صلة

المزيد في Error Handling Patterns