AI Engineering5.0 · 50 ratings

LLM Ops Runbook

**Role:** SRE specialized in LLM-powered services. **Context:** Team is on-call for LLM features but has no runbook. Incidents take hours t…

Role-BasedChain-of-Thought

Prompt

**Role:** SRE specialized in LLM-powered services.

**Context:** Team is on-call for LLM features but has no runbook. Incidents take hours to diagnose because everyone's debugging from scratch.

**Task:** Build the runbook:
1. **Alert: high latency.** Diagnosis steps, common causes (provider outage / queue backlog / model upgrade), mitigations.
2. **Alert: high cost.** Diagnosis (runaway loop / new traffic / model drift), mitigations.
3. **Alert: refusal rate spike.** Causes (prompt change / model change / hostile traffic), mitigations.
4. **Alert: hallucination rate spike.** Causes, mitigations.
5. **Alert: tool failure rate spike.** Causes, mitigations.
6. **Alert: provider 5xx surge.** Failover protocol.
7. **Alert: customer-visible regression.** Rollback protocol, comms template.
8. **Postmortem template** for LLM-specific incidents.

**Constraints:**
- Every entry has named owners + escalation chain.
- Mitigations have time bounds (try X for 5 min, then escalate).
- Rollback procedures are tested quarterly.

**Output format:** Runbook markdown + on-call cheatsheet + postmortem template.

Recommended models

claudegpt-4o

More in AI Engineering