Brainery
← All skills
Engineering

wiki/skills/auth-jwks-rotation-spike.skill.md

Auth Jwks Rotation Spike

Sudden burst of 401 errors with no recent deploy — usually Auth0 JWKS key rotation invalidating cached public keys. Trigger on 401 rate >5x baseline.

Version
v1.0.0
Confidence
88%
Last verified
2026-04-19
Owners
@sam

Extracted from

  • #incidents · 2026-04-19 · maya, sam, deepak
  • #incidents · 2026-03-08 (referenced)
# Auth0 JWKS Rotation Spike

## Symptoms

- Datadog: `auth.401 rate > 10× baseline`
- 5–15% of `/api/*` requests return `401 Unauthorized`
- No recent deploys (last >12h)
- Started suddenly (within 2 min)

## Diagnosis (3 min)

1. **Rule out a deploy** — `gh release list --limit 5`. If a release went out in the last hour, this is likely *not* JWKS — escalate to that release owner.
2. **Check Auth0 dashboard** → Security → Keys → look at "Last rotated" timestamp. If it matches your spike onset within ~5 min, you've found it.
3. **Confirm by checking the JWT verifier logs** — look for `kid mismatch` or `unable to find a signing key that matches`.

## Resolution (5 min)

1. **Force JWKS cache refresh** in the API workers:
   ```bash
   # Easiest: rolling restart api-worker (the in-process jwks-rsa cache rebuilds on first request)
   kubectl rollout restart deployment/api-worker -n prod
   ```
2. **Reduce JWKS cache TTL** in the API config (it's likely 24h by default — set to 1h):
   ```yaml
   auth:
     jwks:
       cache_ttl_seconds: 3600   # was: 86400
   ```
3. **Verify** — 401 rate should return to baseline within 60s of the rolling restart.

## Escalation

- **Primary**: `@sam` (platform, owns auth)
- **If Auth0 dashboard is also down** → check status.auth0.com, escalate to vendor support
- **If 401s persist after restart** → this is *not* JWKS rotation; treat as a real auth bug, page `@deepak` (backend lead)

## Why this keeps happening

This was the second occurrence (first: **2026-03-08**). Underlying cause: we *poll* JWKS on a TTL instead of subscribing to Auth0's webhook for key rotation events.

## Prevention follow-ups

- **Subscribe to Auth0 webhook** for `client.credentials.rotated` events → invalidate JWKS cache on receipt (ticket open since March)
- **Reduce default JWKS TTL** to 1h (done in v3.4.0)
- **Add specific alert** on `kid mismatch` log lines → fires earlier than the generic 401 rate alert

## Source

Extracted from `#incidents` thread on **2026-04-19 11:42–12:03 UTC**. 8 messages. Resolved in ~10 minutes. Cross-references **2026-03-08** thread (not in current corpus, manually annotated by @maya).

## See also

- [[companies/acme-logistics]] — tenant context
- [[people/sam]] — primary owner; cache-layer escalations
- [[people/maya]] — co-author from the 2026-04-19 incident
- [[skills/slow-api-bad-deploy.skill]] — alternate hypothesis when 401s coincide with a deploy
- [[skills/redis-cache-eviction.skill]] — adjacent runbook (cache patterns)