Testing & Mocking

API Performance Benchmarking: Establishing and Tracking Baselines

How to establish API performance baselines, track them over time, and catch regressions — benchmarking methodology, statistical significance, and CI integration.

Why Benchmark

Performance regressions are silent bugs. A database query that used to take 5ms now takes 250ms because someone added a missing index migration that never got written. The feature passes all functional tests but degrades under load. Without benchmarks, you discover this in production when users complain.

Benchmarking serves three purposes:

  • Regression detection: Catch performance degradation before deployment
  • Capacity planning: Know how many requests/second your API can handle before you need to scale
  • SLO validation: Verify that p99 latency stays within your Service Level Objective

Benchmarking Methodology

Raw benchmark numbers are meaningless without methodology. A benchmark that doesn't control for warm-up, sample size, and environmental variables produces noise, not data.

Warm-Up Phase

APIs are slower on first requests — JIT compilation, connection pool initialization, cold caches. Always discard warm-up results:

// k6: discard first 30 seconds as warm-up
export const options = {
  stages: [
    { duration: '30s', target: 10 },   // warm-up (not measured)
    { duration: '2m',  target: 10 },   // steady state (measured)
    { duration: '10s', target: 0  },   // ramp-down
  ],
  thresholds: {
    // Only check thresholds after warm-up
    'http_req_duration{phase:steady}': ['p(95)<200'],
  },
};

export default function () {
  const phase = __VU <= 10 && __ITER < 5 ? 'warmup' : 'steady';
  const tags = { phase };
  http.get('https://api.example.com/products', { tags });
}

Sample Size

More samples = more statistical confidence. For latency benchmarks, aim for at least:

  • 1,000 requests for p95 accuracy
  • 10,000 requests for p99 accuracy
  • 100,000 requests for p99.9 accuracy

p99 calculated from 100 samples is nearly meaningless — the 99th percentile of 100 samples is just the maximum.

Controlling Variables

# Bad: benchmark on developer laptop with other processes running
wrk -t4 -c100 -d60s http://localhost:8000/api/products

# Good: benchmark on dedicated CI runner with consistent hardware
# - Same instance type (e.g., c5.xlarge) every time
# - Benchmark against staging, not production
# - Run at the same time of day to avoid traffic-driven DB cache differences
# - Use the same dataset (seed the DB before each benchmark run)

Tool: wrk

wrk is a C-based HTTP benchmarking tool that produces minimal overhead, making it good for measuring raw throughput:

# Basic benchmark: 12 threads, 400 connections, 30 seconds
wrk -t12 -c400 -d30s http://staging.example.com/api/products

# Output:
# Running 30s test @ http://staging.example.com/api/products
#   12 threads and 400 connections
#   Thread Stats   Avg      Stdev     Max   +/- Stdev
#     Latency    87.32ms   22.41ms  350.12ms   74.23%
#     Req/Sec   374.22     82.15    780.00     68.50%
#   Latency Distribution
#      50%   81.44ms
#      75%   95.22ms
#      90%  117.31ms
#      99%  182.54ms
#   134120 requests in 30.03s, 22.50MB read
# Requests/sec:   4466.28

# With custom Lua script (POST with auth)
wrk -t4 -c50 -d30s -s benchmark.lua http://staging.example.com/api/orders
-- benchmark.lua
wrk.method = 'POST'
wrk.headers['Content-Type'] = 'application/json'
wrk.headers['Authorization'] = 'Bearer bench-token'
wrk.body = '{"item_id": 1, "quantity": 1}'

Tool: hey (Go)

hey is a simpler alternative to wrk with friendlier output:

# 200 requests, 50 concurrent
hey -n 200 -c 50 http://staging.example.com/api/products

# With headers and POST body
hey -n 1000 -c 20 \
  -m POST \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer bench-token' \
  -d '{"item_id": 1, "qty": 1}' \
  http://staging.example.com/api/orders

# Output includes:
# Response time histogram
# Latency distribution (p50, p75, p90, p95, p99)
# Status code distribution

Statistical Analysis

Do not compare benchmark runs using mean latency alone. Mean is sensitive to outliers. Use percentiles:

import statistics
import json

def analyze_benchmark(latencies_ms: list[float]) -> dict:
    sorted_lat = sorted(latencies_ms)
    n = len(sorted_lat)
    return {
        'count': n,
        'mean': statistics.mean(sorted_lat),
        'median': statistics.median(sorted_lat),
        'stdev': statistics.stdev(sorted_lat),
        'p50': sorted_lat[int(n * 0.50)],
        'p90': sorted_lat[int(n * 0.90)],
        'p95': sorted_lat[int(n * 0.95)],
        'p99': sorted_lat[int(n * 0.99)],
        'max': max(sorted_lat),
    }

baseline = analyze_benchmark(load_baseline_results())
current  = analyze_benchmark(load_current_results())

regression_pct = (current['p99'] - baseline['p99']) / baseline['p99'] * 100
if regression_pct > 20:
    print(f'REGRESSION: p99 latency increased by {regression_pct:.1f}%')
    print(f'  Baseline p99: {baseline["p99"]:.1f}ms')
    print(f'  Current  p99: {current["p99"]:.1f}ms')
    exit(1)

CI Integration

Store benchmark results and compare against a baseline on every PR:

# .github/workflows/benchmark.yml
on:
  push:
    branches: [main]
  pull_request:

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Start staging server
        run: docker compose up -d

      - name: Run k6 benchmark
        run: |
          k6 run --out json=benchmark-results.json \
            tests/benchmark.js

      - name: Compare with baseline
        run: python scripts/compare_benchmarks.py \
          --baseline benchmarks/baseline.json \
          --current benchmark-results.json \
          --threshold 20  # fail if >20% regression

      - name: Update baseline (main branch only)
        if: github.ref == 'refs/heads/main'
        run: |
          cp benchmark-results.json benchmarks/baseline.json
          git add benchmarks/baseline.json
          git commit -m 'chore: update benchmark baseline'
          git push

Tool: Bencher (Continuous Benchmarking)

Bencher is a purpose-built platform for continuous benchmarking that tracks results over time and visualizes regressions:

# Install Bencher CLI
curl https://bencher.dev/download/install-cli.sh | bash

# Run benchmark and report to Bencher
bencher run \
  --project my-api \
  --token $BENCHER_API_TOKEN \
  --branch main \
  --testbed ci-runner \
  --adapter json \
  'k6 run --out json=results.json tests/benchmark.js && cat results.json'

Benchmarking Database Queries

API latency is often dominated by database queries. Benchmark queries directly:

import time
from django.test import TestCase
from django.db import connection, reset_queries
from django.conf import settings

settings.DEBUG = True  # enable query logging

class QueryBenchmarkTests(TestCase):
    def test_order_list_query_count_and_time(self):
        # Seed 1000 orders
        for i in range(1000):
            OrderFactory()

        reset_queries()
        start = time.monotonic()
        response = self.client.get('/api/orders/?page=1&page_size=20')
        elapsed_ms = (time.monotonic() - start) * 1000

        self.assertEqual(response.status_code, 200)
        # Assert N+1 queries are not present
        self.assertLessEqual(len(connection.queries), 5, 
            f'Too many queries: {len(connection.queries)}')
        # Assert response time
        self.assertLess(elapsed_ms, 100, 
            f'Order list took {elapsed_ms:.1f}ms — expected < 100ms')

Setting Meaningful Thresholds

Benchmarking thresholds should be derived from real SLOs, not arbitrary numbers:

Endpointp50 Targetp95 Targetp99 TargetRationale
GET /products<50ms<100ms<200msFrequently cached, read-heavy
POST /orders<200ms<500ms<1000msWrite path, DB transaction
GET /search<100ms<300ms<500msIndex scan, weighted scoring

When a benchmark exceeds its threshold, the CI job fails and the PR cannot merge. This creates a forcing function to investigate and fix performance regressions before they compound.

관련 프로토콜

관련 용어

더 보기: Testing & Mocking