← All skillsEngineering
wiki/skills/payment-stuck-pending.skill.md
Payment Stuck Pending
Stripe webhook backlog growing — orders stuck in pending state. Almost always one customer sending oversized webhook payloads.
- Version
- v1.0.0
- Confidence
- 85%
- Last verified
- 2026-04-25
- Owners
- @maya @deepak
Extracted from
- → #incidents · 2026-04-25 · maya, deepak
# Payment Stuck — Pending
## Symptoms
- Webhook bot fires: `webhook backlog: NN events queued > 5min`
- Customer-facing: orders staying in `pending` state past expected ~30s window
- Mostly affects `payment_intent.succeeded` events
- Customer Support starts getting tickets within ~10 min if not resolved
## Diagnosis (5 min)
1. **Check the queue depth** in the order-finalizer worker dashboard. Confirm it's growing, not draining.
2. **Look at the head of the queue** — what's the `payment_intent.id` and customer? Tail logs:
```bash
kubectl logs -n prod deployment/order-finalizer --tail=20 | grep -i "processing"
```
3. **Check the metadata size** of the stuck event. If it's > 1MB, you've found it — single-customer issue.
4. **Identify the customer** by `cust_*` ID. Check if it's a high-volume customer (hits more orders).
## Resolution (10 min)
1. **Truncate the offending metadata** at the worker level — deploy this guard:
```python
if len(json.dumps(event.metadata)) > 4096:
event.metadata = {"_truncated": True, "_original_size": original_size}
```
2. **Replay the stuck queue** from the last successfully-processed event ID.
3. **Email the customer** about the metadata format issue. Use the standard template `templates/customer-webhook-oversized.txt`.
4. **Verify** — queue should drain to 0 within 5 min. Check that affected orders move from `pending` → `confirmed`.
## Escalation
- **Primary**: `@maya` (on-call) for triage
- **Backend owner**: `@deepak` for queue/worker code
- **Customer comms**: notify CS lead before emailing the customer (they'll want to know)
## Common pattern
Every time this has happened, it's been **one customer** doing something unusual:
- Northwind Logistics (April 25): 5MB shipment manifest in metadata
- Rivertown Co (Feb 18): Base64-encoded PDF in metadata
- Atlas Freight (Jan 7): Recursive object causing parse loop
→ **First check is always: which customer? what changed for them recently?**
## Prevention follow-ups
- **Validate webhook payload size at edge** before queueing (ticket #4382, in progress)
- **Per-customer queue isolation** so one bad actor can't block everyone
- **Stripe metadata size limit** documented in our public API contract
## Source
Extracted from `#incidents` thread on **2026-04-25 14:08–14:30 UTC**. 10 messages. Resolved in ~22 minutes. Affected ~200 orders for one customer.
## See also
- [[companies/acme-logistics]] — tenant context
- [[people/maya]] — co-owner
- [[people/deepak]] — co-owner
- [[skills/db-connection-pool-exhausted.skill]] — related infra (DB backpressure)
- [[skills/slow-api-bad-deploy.skill]] — adjacent eng runbook