← All skillsEngineering
wiki/skills/redis-cache-eviction.skill.md
Redis Cache Eviction
Redis cache miss rate spikes alongside memory > 90% — usually a recent change to cache TTL or key prefix.
- Version
- v1.0.0
- Confidence
- 90%
- Last verified
- 2026-05-01
- Owners
- @sam
Extracted from
- → #incidents · 2026-05-01 · sam, deepak, maya
# Redis Cache Eviction Storm
## Symptoms
- Datadog: `redis cache miss rate > 50%` (baseline 10–15%)
- Datadog: `redis instance memory > 90%`
- Eviction rate (`evicted_keys`) climbing
- API p95 latency 2–3× baseline (cache misses fall through to Postgres)
## Diagnosis (5 min)
1. **Confirm in ElastiCache console** — open prod-redis cluster, check Memory Usage and Eviction Count.
2. **Find what's filling memory** — most common: someone deployed a new cache prefix with a bad TTL.
```bash
# SSH to a redis-cli capable host, then:
redis-cli --bigkeys
redis-cli SCAN 0 MATCH "*" COUNT 100 | head -20
```
3. **Look at recent deploys** affecting cache code (last 24h):
```bash
gh pr list --state merged --search "merged:>=$(date -u -v-1d +%Y-%m-%d)"
```
## Resolution (5–10 min)
1. **Flush the offending prefix only** (do NOT FLUSHALL):
```bash
redis-cli --scan --pattern "shipment-events:*" | xargs redis-cli DEL
```
…or in older redis without `--scan`:
```bash
redis-cli EVAL "for _,k in ipairs(redis.call('keys', ARGV[1])) do redis.call('del', k) end" 0 "shipment-events:*"
```
2. **Verify memory drops** in ElastiCache console.
3. **Patch the TTL** in code immediately and ship — cache will refill correctly with sane TTL.
## Escalation
- **Primary**: `@sam` (platform, owns ElastiCache)
- **Code owner**: depends on the offending prefix — find via `git blame` on the cache code
## Common root causes
- **Wrong TTL units** (April 30: `30d` instead of `30m` — 1440× larger)
- **Cache stampede** after a key invalidation (March 11)
- **Customer-driven cardinality** — single customer hitting a per-tenant cache key with thousands of unique values (rare)
## Prevention follow-ups
- **Code guard**: reject any cache TTL > 24h without an explicit `--allow-long-ttl` flag (shipped in v3.4.2)
- **Memory budget per prefix** — nobody owns this yet, ticket #5102
## Source
Extracted from `#incidents` thread on **2026-05-01 09:11–09:30 UTC**. 11 messages. Resolved in ~10 minutes.
## See also
- [[companies/acme-logistics]] — tenant context
- [[people/sam]] — primary owner; cache layer
- [[skills/slow-api-bad-deploy.skill]] — cache miss → API slowness causal chain
- [[skills/auth-jwks-rotation-spike.skill]] — adjacent (JWKS cache TTL pattern)