← 返回
数据分析 中文

Post-Mortem & Incident Review

Guide structured, blameless post-mortems with root cause analysis, action tracking, and prevention steps to reduce repeat production incidents and outages.
引导结构化、无指责的事后复盘,通过根因分析、行动跟进与预防措施,减少生产环境重复事件与宕机。
1kalin
数据分析 clawhub v1.0.0 1 版本 99869.8 Key: 无需
★ 0
Stars
📥 767
下载
💾 14
安装
1
版本
#blameless#incident#latest#ops#post-mortem#root-cause#sre

概述

Post-Mortem & Incident Review Framework

Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.

When to Use

  • After any production incident, outage, or service degradation
  • After a missed deadline, failed launch, or lost deal
  • After any event costing >$5K or >4 hours of team time
  • Quarterly review of recurring incident patterns

Post-Mortem Template

1. Incident Summary (Complete Within 24 Hours)

Incident ID: [AUTO-GENERATED]
Date/Time: [Start] → [End] (Duration: X hours)
Severity: SEV-1 (revenue impact) | SEV-2 (customer impact) | SEV-3 (internal impact)
Impact: [Users affected] | [Revenue lost] | [SLA breached Y/N]
Detection: How was it found? (Monitoring / Customer report / Internal discovery)
Detection Delay: Time from incident start → first alert

2. Timeline (Minute-by-Minute for SEV-1, 15-min blocks for SEV-2/3)

HH:MM - Event description
HH:MM - First alert triggered
HH:MM - Team notified
HH:MM - Investigation started
HH:MM - Root cause identified
HH:MM - Fix deployed
HH:MM - Confirmed resolved

3. Root Cause Analysis — 5 Whys

Why 1: [Direct cause]
Why 2: [Why did that happen?]
Why 3: [Why did THAT happen?]
Why 4: [Systemic cause]
Why 5: [Organizational/cultural root]

4. Contributing Factors

Score each factor 0-3 (0=not a factor, 3=primary contributor):

FactorScoreNotes
---------
Missing/inadequate monitoring
Insufficient testing
Documentation gaps
Process not followed
Knowledge concentration (bus factor)
Capacity/scaling limits
Third-party dependency
Communication breakdown
Change management failure
Technical debt

5. What Went Well

List 3-5 things that worked during the response:

  • Fast detection? Good runbooks? Strong communication? Quick escalation?

6. Action Items

Every action MUST have an owner and deadline:

#ActionOwnerDeadlinePriorityStatus
------------------
1P0/P1/P2Open

Priority definitions:

  • P0: Must complete before next business day
  • P1: Must complete within 1 week
  • P2: Must complete within 1 sprint/month

7. Recurrence Prevention

  • [ ] Monitoring added/improved for this failure mode
  • [ ] Runbook created/updated
  • [ ] Test coverage added
  • [ ] Architecture change needed? (If yes, create RFC)
  • [ ] Training needed for team?

Blameless Post-Mortem Rules

  1. Focus on systems, not individuals
  2. "What happened" not "who did it"
  3. Assume everyone acted with best intentions and available information
  4. The goal is learning, not punishment
  5. If you find yourself writing someone's name next to a mistake, rewrite it as a process gap

Incident Cost Calculator

Direct costs:
  Revenue lost during downtime: $___
  SLA credits issued: $___
  Emergency vendor/contractor costs: $___

Indirect costs:
  Engineering hours × loaded rate: ___ hrs × $___/hr = $___
  Customer churn risk (affected users × churn probability × LTV): $___
  Brand/reputation (estimate): $___

Total incident cost: $___
Cost per minute of downtime: $___

Quarterly Incident Review

Every quarter, analyze patterns across all post-mortems:

  1. Top 3 root cause categories — Where should you invest in prevention?
  2. Mean time to detect (MTTD) — Is monitoring improving?
  3. Mean time to resolve (MTTR) — Is response getting faster?
  4. Action item completion rate — Are you actually fixing things?
  5. Repeat incidents — Same root cause twice = systemic failure
  6. Cost trend — Total incident cost per quarter (should decrease)

Industry-Specific Post-Mortem Considerations

IndustryKey FocusRegulatory Requirement
---------
FintechTransaction integrity, audit trailSOX, PCI-DSS incident reporting
HealthcarePHI exposure, patient safetyHIPAA breach notification (60 days)
SaaSSLA compliance, data integritySOC 2 incident management
E-commerceOrder integrity, payment processingPCI-DSS, consumer protection
ManufacturingSafety incidents, production lossOSHA reporting requirements

Go Deeper

Your post-mortems reveal where AI agents should be deployed first — the repetitive failures, the manual monitoring gaps, the processes that break under load.

Built by AfrexAI — turning incident patterns into automation opportunities.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 10:27 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

suspicious
查看报告

🔗 相关推荐

content-creation

Social Media Scheduler

1kalin
跨平台策划、起草与组织社交媒体内容;制定内容日历,撰写针对各平台优化的帖子,并保持稳定的发布节奏。
★ 15 📥 13,173
data-analysis

A股量化 AkShare

mbpz
A股量化数据分析工具,基于AkShare库获取A股行情、财务数据、板块信息等。用于回答关于A股股票查询、行情数据、财务分析、选股等问题。
★ 165 📥 60,119
data-analysis

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 368 📥 140,661