Brainery
← All skills
Engineering

wiki/skills/slow-api-bad-deploy.skill.md

Slow Api Bad Deploy

API p99 latency spike within 30 min of a deploy. Default action is rollback first, debug second.

Version
v1.0.0
Confidence
95%
Last verified
2026-04-30
Owners
@deepak

Extracted from

  • #incidents · 2026-04-30 · deepak, maya, sam
# Slow API — Bad Deploy

## Symptoms

- Datadog: API p99 latency 3–5× baseline
- One specific endpoint or family (`/api/shipments/*`, `/api/orders/*`, etc.)
- Started within 30 min of a recent API deploy

## Decision rule (THE ONE THING TO REMEMBER)

> **If a p99 spike happens within 15 min of a deploy → ROLLBACK FIRST, debug second.**

Do not profile on prod. Do not try to identify the bad query. Do not "wait and see if it stabilizes." Roll back.

The cost of a 4-minute rollback is far less than the cost of a 30-minute investigation while customers are timing out.

## Resolution (5 min)

1. **Identify the prior good tag**:
   ```bash
   gh release list --limit 5
   # Look for the deploy *before* the suspect one
   ```
2. **Roll back**:
   ```bash
   ./scripts/deploy.sh api --version v3.4.0  # the prior tag
   ```
3. **Wait for full rollout** (~3–4 min for ECS to drain old tasks).
4. **Verify** — p99 should drop to baseline within 60s of full rollout.

## Diagnosis (AFTER rollback, in staging)

Now that you're not on fire:
1. Pull the diff between the suspect tag and the prior tag.
2. Run the affected endpoint under load against staging with `k6` or `wrk`.
3. **Check for N+1 queries** — most common cause. Use `pg_stat_statements` or APM trace.
4. Fix, ship, monitor.

## Common root causes (from past incidents)

- **N+1 query** introduced in a new endpoint (April 30 — shipments-search)
- **Missing index** after schema change (March 14)
- **Unbounded `LIMIT`** on a list endpoint (February 22)

## Escalation

- **Primary**: deploy author (whoever shipped the suspect release)
- **If deploy author unavailable**: `@deepak` (backend lead)
- **If the spike is in payment-related endpoints**: also notify `@maya` (CS will get tickets fast)

## Why staging didn't catch it

Our load testing seeds **50 carriers**; production has **800**. N+1 queries that are linear in row count don't surface until you hit prod scale.

→ **Open ticket**: seed staging with prod-scale data (anonymized).

## Source

Extracted from `#incidents` thread on **2026-04-30 16:22–16:41 UTC**. 9 messages. Resolved in ~15 minutes (rollback only — fix shipped next day).

## See also

- [[companies/acme-logistics]] — tenant context
- [[people/deepak]] — primary owner; rollback decision authority
- [[skills/db-connection-pool-exhausted.skill]] — common p99 root cause
- [[skills/redis-cache-eviction.skill]] — cache regression after deploy
- [[skills/auth-jwks-rotation-spike.skill]] — distinguish from auth issues