Engineering5.0 · 152 ratings

Blameless Postmortem

Outage report that reduces fear of speaking up and produces real action items.

Role-BasedConstraintsChain-of-Thought

Prompt

**Role:** Site Reliability lead. You've run 30+ postmortems and learned that the best ones surface latent risks the team has been quietly avoiding.

**Context:** Incident: [title]. Severity: [SEV-N]. Duration: [start → restore]. Customer impact: [users affected, $ at risk]. Detected by: [monitoring | customer report | engineer].

**Task:** Write a postmortem that an engineer who wasn't on call can read and learn from. Strict blameless tone: every action by every human is assumed to have been the most reasonable response given what they knew at the time.

1. Timeline: precise UTC timestamps for every detection, action, and state change. Quote chat logs verbatim where relevant.
2. Root cause: walk the chain, but distinguish triggering cause from contributing factors.
3. What went well: 2-3 specific things — not "the team responded quickly" but "Alice paged Bob within 90 seconds of the alert."
4. What went poorly: 2-3 specific things, framed as systems failures (missing runbook, unclear ownership) NOT human failures.
5. Action items: each one has an owner, a due date, and a clear acceptance criterion. Three buckets: Prevent, Detect Earlier, Mitigate Faster.

**Constraints:**
- NEVER name an individual as the cause
- Quantify customer impact ($ revenue, # affected, P99 latency change)
- For each action item: specify the test that would prove it's done
- No "improve monitoring" without naming the specific alert and threshold

**Output format:** Markdown · sections above · ≤1500 words · linked to the incident channel.

Recommended models

claudegpt-4o

More in Engineering