← All skillsEngineering
wiki/skills/db-connection-pool-exhausted.skill.md
Db Connection Pool Exhausted
Postgres connection pool > 95% — typically caused by long-running transactions in worker code. Use when API p99 spikes alongside high pool utilization on prod-rds.
- Version
- v1.0.0
- Confidence
- 92%
- Last verified
- 2026-05-01
- Owners
- @maya @deepak
Extracted from
- → slack://#incidents · 2026-04-12 · maya, deepak, sam
- → github://releases (gh release list pattern)
- → github://prs/#5089 (cache TTL diff)
- → notion://runbooks/db-pool-exhausted
- → linear://EN-2118 (in-progress webhook subscription)
# DB Connection Pool Exhausted
## Symptoms
- PagerDuty fires: `db.connection.pool > 95%` on `prod-rds`
- API p99 latency climbs above 4s (baseline ~800ms)
- `pg_stat_activity` shows many `idle in transaction` sessions
- Almost always traceable to a recent worker deploy
## Diagnosis (5 min)
1. **Confirm scope** — is this read replica or primary? `prod-rds` = primary.
2. **Connect to RDS** and query:
```sql
SELECT pid, state, wait_event, query_start, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY query_start ASC
LIMIT 20;
```
3. **Identify the worker** — look at the `query` text. It's almost always one specific worker pod (the recent deploy).
4. **Check recent deploys** — `gh release list --limit 10` and look for anything in the last 60 min affecting workers.
## Resolution (10–15 min)
1. **Roll back the offending worker** to its previous tag — this is the fast fix and almost always correct.
2. **Kill the long-running connections** if they don't drain cleanly:
```sql
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND query_start < now() - interval '60 seconds';
```
3. **Optional: temporarily raise** the RDS connection limit via parameter group if traffic is still climbing. This is a stopgap, not a fix.
4. **Verify** — pool utilization should drop within 2–3 min after worker rollback.
## Escalation
- **Primary owner** (worker code): `@deepak` (backend lead)
- **Database owner**: `@sam` (platform)
- **If unable to identify the worker after 10 min** → page `@deepak` even if asleep. The pattern is reliable enough that he'll see it instantly.
## Common root causes (from past incidents)
- `SELECT FOR UPDATE` in a worker loop without `COMMIT` (April 12 incident)
- Long-running migration accidentally run on prod (March 2)
- Worker crash leaving connections orphaned (rare, ~once per quarter)
## Prevention follow-ups
- Add Datadog alert on `idle_in_transaction count > 20` (gives ~10 min earlier signal than pool %)
- Code review checklist: any `SELECT FOR UPDATE` in a worker must have explicit timeout + commit
- Worker liveness probe should restart if a worker holds a transaction > 60s
## Source
Extracted from `#incidents` thread on **2026-04-12 03:14–03:27 UTC**. 10 messages. Resolved in ~12 minutes.
## See also
- [[companies/acme-logistics]] — tenant context
- [[people/maya]] — co-owner
- [[people/deepak]] — co-owner; backend-lead escalation
- [[skills/payment-stuck-pending.skill]] — related infra (Stripe webhook backlog)
- [[skills/slow-api-bad-deploy.skill]] — overlapping symptom (p99 spike on prod-rds)