← All skillsEngineering
wiki/skills/slow-api-bad-deploy.skill.md
Slow Api Bad Deploy
API p99 latency spike within 30 min of a deploy. Default action is rollback first, debug second.
- Version
- v1.0.0
- Confidence
- 95%
- Last verified
- 2026-04-30
- Owners
- @deepak
Extracted from
- → #incidents · 2026-04-30 · deepak, maya, sam
# Slow API — Bad Deploy ## Symptoms - Datadog: API p99 latency 3–5× baseline - One specific endpoint or family (`/api/shipments/*`, `/api/orders/*`, etc.) - Started within 30 min of a recent API deploy ## Decision rule (THE ONE THING TO REMEMBER) > **If a p99 spike happens within 15 min of a deploy → ROLLBACK FIRST, debug second.** Do not profile on prod. Do not try to identify the bad query. Do not "wait and see if it stabilizes." Roll back. The cost of a 4-minute rollback is far less than the cost of a 30-minute investigation while customers are timing out. ## Resolution (5 min) 1. **Identify the prior good tag**: ```bash gh release list --limit 5 # Look for the deploy *before* the suspect one ``` 2. **Roll back**: ```bash ./scripts/deploy.sh api --version v3.4.0 # the prior tag ``` 3. **Wait for full rollout** (~3–4 min for ECS to drain old tasks). 4. **Verify** — p99 should drop to baseline within 60s of full rollout. ## Diagnosis (AFTER rollback, in staging) Now that you're not on fire: 1. Pull the diff between the suspect tag and the prior tag. 2. Run the affected endpoint under load against staging with `k6` or `wrk`. 3. **Check for N+1 queries** — most common cause. Use `pg_stat_statements` or APM trace. 4. Fix, ship, monitor. ## Common root causes (from past incidents) - **N+1 query** introduced in a new endpoint (April 30 — shipments-search) - **Missing index** after schema change (March 14) - **Unbounded `LIMIT`** on a list endpoint (February 22) ## Escalation - **Primary**: deploy author (whoever shipped the suspect release) - **If deploy author unavailable**: `@deepak` (backend lead) - **If the spike is in payment-related endpoints**: also notify `@maya` (CS will get tickets fast) ## Why staging didn't catch it Our load testing seeds **50 carriers**; production has **800**. N+1 queries that are linear in row count don't surface until you hit prod scale. → **Open ticket**: seed staging with prod-scale data (anonymized). ## Source Extracted from `#incidents` thread on **2026-04-30 16:22–16:41 UTC**. 9 messages. Resolved in ~15 minutes (rollback only — fix shipped next day). ## See also - [[companies/acme-logistics]] — tenant context - [[people/deepak]] — primary owner; rollback decision authority - [[skills/db-connection-pool-exhausted.skill]] — common p99 root cause - [[skills/redis-cache-eviction.skill]] — cache regression after deploy - [[skills/auth-jwks-rotation-spike.skill]] — distinguish from auth issues