Structured root cause analysis and prevention protocol for production incidents.
Report language matches the user's language. Chinese request → Chinese report. English → English.
Before writing anything, collect raw evidence. Do NOT rely on memory or assumptions.
Required evidence (use tools to retrieve):
exec("grep -i 'error\|fatal\|exception' {logfile} | tail -50")exec("git log --oneline -10"), exec("git diff HEAD~1 --stat")exec("systemctl status {service}"), exec("ps aux | grep {process}")read any CSVs, configs, or state files involvedRule: Every factual claim in the postmortem must cite a source. Format: [Source: {filepath}:{line} or {command output}]
If evidence is unavailable (logs rotated, service restarted), explicitly mark: [Evidence unavailable: {reason}]
Construct a minute-by-minute (or hour-by-hour) timeline from evidence:
HH:MM UTC — {what happened} [Source: {evidence}]
HH:MM UTC — {what happened} [Source: {evidence}]
...
HH:MM UTC — Incident resolved / mitigated
Include: first symptom, detection, escalation, diagnosis, fix, verification.
Mark the detection gap: time between first symptom and human awareness. This is often the real problem.
Drill down from symptom to root cause:
Why 1: {symptom happened} → Because {direct cause}
Why 2: {direct cause} → Because {deeper cause}
Why 3: {deeper cause} → Because {systemic issue}
Why 4: {systemic issue} → Because {process/design gap}
Why 5: {process/design gap} → Because {root cause}
Stop when you reach something you can change. If you reach "the model hallucinated" — that's not actionable. Go deeper: why was the output trusted without verification? Why was there no checkpoint?
Save to: ~/incidents/INC{NNN}_{TOPIC}_{YYYYMMDD}.md
Create ~/incidents/ if it doesn't exist.
See references/report-template.md for the full template with all required sections.
Report must include all 8 sections:
The most valuable output of a postmortem is new rules that prevent recurrence.
Pattern: Incident → Rule → Enforcement mechanism
Examples from real incidents:
See references/patterns.md for a library of incident-to-rule patterns.
~/incidents/| Level | Criteria | Response Time |
|---|---|---|
| ------- | ---------- | --------------- |
| SEV1 | Money lost, data corrupted, security breach | Immediate |
| SEV2 | Service down, wrong actions taken from bad data | Within 1 hour |
| SEV3 | Degraded performance, near-miss, wasted time >2h | Within 24 hours |
| SEV4 | Minor issue, caught before impact | Next convenient time |
共 1 个版本