Brainery
← All skills
Engineering

wiki/skills/redis-cache-eviction.skill.md

Redis Cache Eviction

Redis cache miss rate spikes alongside memory > 90% — usually a recent change to cache TTL or key prefix.

Version
v1.0.0
Confidence
90%
Last verified
2026-05-01
Owners
@sam

Extracted from

  • #incidents · 2026-05-01 · sam, deepak, maya
# Redis Cache Eviction Storm

## Symptoms

- Datadog: `redis cache miss rate > 50%` (baseline 10–15%)
- Datadog: `redis instance memory > 90%`
- Eviction rate (`evicted_keys`) climbing
- API p95 latency 2–3× baseline (cache misses fall through to Postgres)

## Diagnosis (5 min)

1. **Confirm in ElastiCache console** — open prod-redis cluster, check Memory Usage and Eviction Count.
2. **Find what's filling memory** — most common: someone deployed a new cache prefix with a bad TTL.
   ```bash
   # SSH to a redis-cli capable host, then:
   redis-cli --bigkeys
   redis-cli SCAN 0 MATCH "*" COUNT 100 | head -20
   ```
3. **Look at recent deploys** affecting cache code (last 24h):
   ```bash
   gh pr list --state merged --search "merged:>=$(date -u -v-1d +%Y-%m-%d)"
   ```

## Resolution (5–10 min)

1. **Flush the offending prefix only** (do NOT FLUSHALL):
   ```bash
   redis-cli --scan --pattern "shipment-events:*" | xargs redis-cli DEL
   ```
   …or in older redis without `--scan`:
   ```bash
   redis-cli EVAL "for _,k in ipairs(redis.call('keys', ARGV[1])) do redis.call('del', k) end" 0 "shipment-events:*"
   ```
2. **Verify memory drops** in ElastiCache console.
3. **Patch the TTL** in code immediately and ship — cache will refill correctly with sane TTL.

## Escalation

- **Primary**: `@sam` (platform, owns ElastiCache)
- **Code owner**: depends on the offending prefix — find via `git blame` on the cache code

## Common root causes

- **Wrong TTL units** (April 30: `30d` instead of `30m` — 1440× larger)
- **Cache stampede** after a key invalidation (March 11)
- **Customer-driven cardinality** — single customer hitting a per-tenant cache key with thousands of unique values (rare)

## Prevention follow-ups

- **Code guard**: reject any cache TTL > 24h without an explicit `--allow-long-ttl` flag (shipped in v3.4.2)
- **Memory budget per prefix** — nobody owns this yet, ticket #5102

## Source

Extracted from `#incidents` thread on **2026-05-01 09:11–09:30 UTC**. 11 messages. Resolved in ~10 minutes.

## See also

- [[companies/acme-logistics]] — tenant context
- [[people/sam]] — primary owner; cache layer
- [[skills/slow-api-bad-deploy.skill]] — cache miss → API slowness causal chain
- [[skills/auth-jwks-rotation-spike.skill]] — adjacent (JWKS cache TTL pattern)