← All skillsEngineering
wiki/skills/auth-jwks-rotation-spike.skill.md
Auth Jwks Rotation Spike
Sudden burst of 401 errors with no recent deploy — usually Auth0 JWKS key rotation invalidating cached public keys. Trigger on 401 rate >5x baseline.
- Version
- v1.0.0
- Confidence
- 88%
- Last verified
- 2026-04-19
- Owners
- @sam
Extracted from
- → #incidents · 2026-04-19 · maya, sam, deepak
- → #incidents · 2026-03-08 (referenced)
# Auth0 JWKS Rotation Spike
## Symptoms
- Datadog: `auth.401 rate > 10× baseline`
- 5–15% of `/api/*` requests return `401 Unauthorized`
- No recent deploys (last >12h)
- Started suddenly (within 2 min)
## Diagnosis (3 min)
1. **Rule out a deploy** — `gh release list --limit 5`. If a release went out in the last hour, this is likely *not* JWKS — escalate to that release owner.
2. **Check Auth0 dashboard** → Security → Keys → look at "Last rotated" timestamp. If it matches your spike onset within ~5 min, you've found it.
3. **Confirm by checking the JWT verifier logs** — look for `kid mismatch` or `unable to find a signing key that matches`.
## Resolution (5 min)
1. **Force JWKS cache refresh** in the API workers:
```bash
# Easiest: rolling restart api-worker (the in-process jwks-rsa cache rebuilds on first request)
kubectl rollout restart deployment/api-worker -n prod
```
2. **Reduce JWKS cache TTL** in the API config (it's likely 24h by default — set to 1h):
```yaml
auth:
jwks:
cache_ttl_seconds: 3600 # was: 86400
```
3. **Verify** — 401 rate should return to baseline within 60s of the rolling restart.
## Escalation
- **Primary**: `@sam` (platform, owns auth)
- **If Auth0 dashboard is also down** → check status.auth0.com, escalate to vendor support
- **If 401s persist after restart** → this is *not* JWKS rotation; treat as a real auth bug, page `@deepak` (backend lead)
## Why this keeps happening
This was the second occurrence (first: **2026-03-08**). Underlying cause: we *poll* JWKS on a TTL instead of subscribing to Auth0's webhook for key rotation events.
## Prevention follow-ups
- **Subscribe to Auth0 webhook** for `client.credentials.rotated` events → invalidate JWKS cache on receipt (ticket open since March)
- **Reduce default JWKS TTL** to 1h (done in v3.4.0)
- **Add specific alert** on `kid mismatch` log lines → fires earlier than the generic 401 rate alert
## Source
Extracted from `#incidents` thread on **2026-04-19 11:42–12:03 UTC**. 8 messages. Resolved in ~10 minutes. Cross-references **2026-03-08** thread (not in current corpus, manually annotated by @maya).
## See also
- [[companies/acme-logistics]] — tenant context
- [[people/sam]] — primary owner; cache-layer escalations
- [[people/maya]] — co-author from the 2026-04-19 incident
- [[skills/slow-api-bad-deploy.skill]] — alternate hypothesis when 401s coincide with a deploy
- [[skills/redis-cache-eviction.skill]] — adjacent runbook (cache patterns)