Brainery
← All skills
Engineering

wiki/skills/db-connection-pool-exhausted.skill.md

Db Connection Pool Exhausted

Postgres connection pool > 95% — typically caused by long-running transactions in worker code. Use when API p99 spikes alongside high pool utilization on prod-rds.

Version
v1.0.0
Confidence
92%
Last verified
2026-05-01
Owners
@maya @deepak

Extracted from

  • slack://#incidents · 2026-04-12 · maya, deepak, sam
  • github://releases (gh release list pattern)
  • github://prs/#5089 (cache TTL diff)
  • notion://runbooks/db-pool-exhausted
  • linear://EN-2118 (in-progress webhook subscription)
# DB Connection Pool Exhausted

## Symptoms

- PagerDuty fires: `db.connection.pool > 95%` on `prod-rds`
- API p99 latency climbs above 4s (baseline ~800ms)
- `pg_stat_activity` shows many `idle in transaction` sessions
- Almost always traceable to a recent worker deploy

## Diagnosis (5 min)

1. **Confirm scope** — is this read replica or primary? `prod-rds` = primary.
2. **Connect to RDS** and query:
   ```sql
   SELECT pid, state, wait_event, query_start, query
   FROM pg_stat_activity
   WHERE state = 'idle in transaction'
   ORDER BY query_start ASC
   LIMIT 20;
   ```
3. **Identify the worker** — look at the `query` text. It's almost always one specific worker pod (the recent deploy).
4. **Check recent deploys** — `gh release list --limit 10` and look for anything in the last 60 min affecting workers.

## Resolution (10–15 min)

1. **Roll back the offending worker** to its previous tag — this is the fast fix and almost always correct.
2. **Kill the long-running connections** if they don't drain cleanly:
   ```sql
   SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE state = 'idle in transaction'
     AND query_start < now() - interval '60 seconds';
   ```
3. **Optional: temporarily raise** the RDS connection limit via parameter group if traffic is still climbing. This is a stopgap, not a fix.
4. **Verify** — pool utilization should drop within 2–3 min after worker rollback.

## Escalation

- **Primary owner** (worker code): `@deepak` (backend lead)
- **Database owner**: `@sam` (platform)
- **If unable to identify the worker after 10 min** → page `@deepak` even if asleep. The pattern is reliable enough that he'll see it instantly.

## Common root causes (from past incidents)

- `SELECT FOR UPDATE` in a worker loop without `COMMIT` (April 12 incident)
- Long-running migration accidentally run on prod (March 2)
- Worker crash leaving connections orphaned (rare, ~once per quarter)

## Prevention follow-ups

- Add Datadog alert on `idle_in_transaction count > 20` (gives ~10 min earlier signal than pool %)
- Code review checklist: any `SELECT FOR UPDATE` in a worker must have explicit timeout + commit
- Worker liveness probe should restart if a worker holds a transaction > 60s

## Source

Extracted from `#incidents` thread on **2026-04-12 03:14–03:27 UTC**. 10 messages. Resolved in ~12 minutes.

## See also

- [[companies/acme-logistics]] — tenant context
- [[people/maya]] — co-owner
- [[people/deepak]] — co-owner; backend-lead escalation
- [[skills/payment-stuck-pending.skill]] — related infra (Stripe webhook backlog)
- [[skills/slow-api-bad-deploy.skill]] — overlapping symptom (p99 spike on prod-rds)