Incident Response Lifecycle

Structured process management for network incidents from detection through

post-incident review. This skill covers the organizational coordination

layer: severity classification, escalation, role assignment, stakeholder

communication, recovery coordination, and root cause analysis. It does not

cover technical evidence collection, device forensics, or containment

execution — use the incident-response-network skill for network-level

evidence gathering and forensic analysis.

The procedure follows the operational lifecycle shape: detect and classify

the incident, triage and escalate to the right people, coordinate the

investigation across teams, manage communications to all audiences, drive

resolution and recovery, then conduct a blameless post-incident review.

See references/communication-templates.md for notification templates by

audience and severity level. See references/rca-framework.md for

the 5-whys methodology, fishbone diagram guidance, and post-mortem

document structure.

When to Use

Service-affecting incident declared — a P1 or P2 event requires

formal incident management with role assignment and communications

Escalation decision needed — determining who to notify at what

severity level and when to engage vendor support or management

Multi-team coordination required — investigation spans network,

security, application, and infrastructure teams needing a single

command structure

Customer or regulatory notification required — incident has

external communication obligations (SLA breach, data exposure,

regulatory reporting)

Post-incident review facilitation — scheduling, structuring, and

running blameless post-mortems with 5-whys root cause analysis

Incident metrics reporting — collecting MTTD, MTTI, MTTR, and

recurrence data for continuous improvement

Prerequisites

Incident management authority — the person initiating this process

must have authorization to declare incidents and assign roles within

the organization

Contact directory — current on-call rosters, escalation contacts

for management, vendor TAC numbers, and regulatory notification

contacts must be accessible

Communication channels — bridge call infrastructure (conference

line or collaboration tool), status page access, and email

distribution lists for each stakeholder group must be established

Incident tracking system — a ticketing system to record the

incident, track actions, and maintain the timeline of events

Defined severity criteria — organizational agreement on what

constitutes P1 through P4 severity (see Threshold Tables below for

a reference framework)

Procedure

Follow these six steps in sequence. Steps 3 and 4 run in parallel once

roles are assigned — investigation coordination and communication

management proceed simultaneously. Each step references templates from

references/communication-templates.md and methodology from

references/rca-framework.md where applicable.

Step 1: Detection and Classification

Classify the incident by severity, type, and scope to determine the

appropriate response level.

Severity assignment — apply the P1–P4 taxonomy from the Threshold

Tables section. Base severity on the highest-impact criterion met.

When multiple criteria apply at different levels, the highest governs.

Incident type classification — categorize as outage (service

unavailable), degradation (reduced capacity), security (unauthorized

access or data exposure), or data loss (corruption or deletion).

Scope determination — assess whether the incident affects a single

device, a network segment, an entire site, or multiple sites. Scope

drives staffing, communication breadth, and recovery complexity.

Initial impact assessment — estimate affected user count, impacted

services and their business criticality, data at risk, and revenue

impact per hour. Record estimates in the incident ticket.

Step 2: Triage and Escalation

Assign roles, notify stakeholders, and set response timeline

expectations based on the severity classification from Step 1.

Role assignment — every P1 or P2 incident needs four named roles:

Incident Commander (IC) — owns the incident end-to-end and makes

escalation decisions; Technical Lead — coordinates diagnostics and

synthesizes findings; Communications Lead — drafts stakeholder

notifications and manages the status page; Scribe — maintains the

real-time timeline and records bridge call decisions. For P3, IC and

Technical Lead may be combined. P4 uses normal operations workflows.

Escalation matrix execution — notify by severity:

P1 — all four roles plus engineering management, VP/director on-call,

vendor TAC if vendor equipment is involved, executive notification

within 30 minutes. P2 — all four roles plus engineering management

within 1 hour. P3 — Technical Lead plus team lead within 4 hours.

P4 — assigned engineer via normal ticket queue.

Response timeline expectations:

P1 — bridge in 15 minutes, first update in 30 minutes, then every

30 minutes. P2 — bridge in 30 minutes, first update in 1 hour, then

every 2 hours. P3 — initial assessment in 4 hours, daily updates.

P4 — acknowledgment within 1 business day.

Vendor engagement criteria — engage vendor TAC when the incident

involves hardware failure, software defects requiring patches, or when

internal triage has not identified root cause within the severity time

window.

Step 3: Investigation Coordination

Coordinate the technical investigation across teams and evidence

sources. For network-level evidence collection (device state, routing

tables, interface data, log retrieval), reference the

incident-response-network skill — this step focuses on organizing the

investigation, not executing forensic commands.

Evidence collection tasking — assign team members to collect

evidence from relevant domains: network devices (via

incident-response-network procedures), application logs, infrastructure

metrics, and security tooling alerts. Each assignee reports findings

to the Technical Lead.

Parallel investigation streams — for complex incidents, run

multiple investigation threads simultaneously. Common parallel

tracks: (1) symptom analysis — what is failing and for whom,

(2) change correlation — what changed recently (deployments, config

modifications, maintenance), (3) external factors — upstream provider

issues, DDoS, DNS resolution failures.

Hypothesis tracking — maintain a running list of hypotheses with

current status (investigating, confirmed, ruled out). Each hypothesis

should have an owner and a validation method. Update the list on every

bridge call.

Timeline of events (ToE) — the Scribe maintains a running

chronological log of when events occurred, when they were detected,

what actions were taken, and what was discovered. The ToE becomes the

foundation for the post-incident review in Step 6.

Subject matter expert engagement — when investigation stalls or

enters an unfamiliar domain, escalate to specialists. Define clear

handoff: what has been tried, what data is available, and what

specific question needs answering.

Step 4: Communication Management

Manage stakeholder communications throughout the incident. Use the

templates in references/communication-templates.md for consistent

messaging across audiences.

Stakeholder notification by audience — executive summary (business

impact, estimated resolution, customer exposure — no technical detail),

technical detail (root cause hypothesis, diagnostics, remediation plan

— delivered on bridge call), customer-facing (service impact, workaround

if available, estimated resolution — via status page), regulatory

(formal notification per compliance framework when required). Use

templates from references/communication-templates.md.

Status update cadence — follow severity-based cadence from Step 2.

Each update includes: current status, progress since last update, next

planned action, and revised time-to-resolution estimate.

Bridge call management — the IC runs calls with a fixed agenda:

(1) technical status from Tech Lead, (2) communication status from

Comms Lead, (3) hypothesis updates, (4) decisions needed, (5) action

items with owners and deadlines. Keep calls focused — park side

discussions as action items.

External notification requirements — track regulatory reporting

deadlines, law enforcement notification when criminal activity is

suspected, customer SLA breach notification per contractual terms, and

vendor escalation for ongoing support.

Step 5: Resolution and Recovery

Drive service restoration through validated recovery steps with

monitoring to confirm the fix holds.

Recovery validation criteria — before declaring resolved, confirm:

(1) service health checks return normal for all affected components,

(2) monitoring dashboards show green for at least 15 minutes (P1) or

30 minutes (P2), (3) no new related alerts during observation, (4)

affected users confirm restoration (sample check for large populations).

Phased restoration — for multi-layer network incidents, restore in

order: core infrastructure → distribution layer → access layer →

end-to-end verification. Verify each phase before proceeding. Do not

restore all layers simultaneously — cascading failures during recovery

are worse than a phased approach.

Back-out plan execution — if the fix causes new issues, execute

the pre-defined rollback. Every remediation action should have a

documented rollback method before execution.

Enhanced monitoring period — maintain heightened monitoring after

resolution: P1 for 24 hours, P2 for 12 hours, P3 through the next

business day. This means reduced alert thresholds on affected systems,

active watch by on-call, and immediate re-escalation if symptoms recur.

Incident closure — send closure notification to all stakeholders

(template in references/communication-templates.md). Update the

ticket with resolution summary, total duration, and final impact.

Schedule the post-incident review.

Step 6: Post-Incident Review

Conduct a blameless post-incident review to identify root cause,

contributing factors, and improvement actions. See

references/rca-framework.md for the full methodology.

Scheduling — hold the post-mortem within 72 hours of incident

resolution while details are fresh. Invite all incident participants

plus relevant stakeholders. Send the invitation using the template in

references/communication-templates.md.

5-whys root cause analysis — apply iteratively: for each "why"

answer, ask "why" again until reaching a systemic root cause (typically

3–5 iterations). See references/rca-framework.md for worked

examples and facilitation guidance.

Contributing factor categorization — classify each contributing

factor as process (missing runbook, unclear escalation path),

people (training gap, staffing shortage), or technology (monitoring

gap, single point of failure, software defect). This categorization

guides the type of remediation action needed.

Action item classification — assign each action item one of four

dispositions: fix (eliminate the root cause), mitigate (reduce

likelihood or impact), accept (risk is within tolerance, document

rationale), or transfer (assign to another team or vendor). Every

fix or mitigate action must have an owner, due date, and verification

method.

Incident metrics — collect and record: Mean Time to Detect (MTTD),

Mean Time to Investigate (MTTI), Mean Time to Resolve (MTTR), total

incident duration, number of customers affected, and whether this is

a recurrence of a previous incident. Track these metrics over time to

measure improvement trends.

Threshold Tables

Severity Classification Matrix

Severity	User Impact	Service Impact	Data Risk	Response SLA
----------	-----------	----------------	-----------	-------------
P1 Critical	>50% of users or all VIP users	Complete outage of revenue-generating service	Confirmed data breach or loss	Bridge in 15 min, updates every 30 min
P2 High	10–50% of users affected	Major degradation or redundancy loss on critical path	Suspected data exposure	Bridge in 30 min, updates every 2 hr
P3 Medium	<10% of users, workaround exists	Partial degradation, non-critical service	No data risk identified	Assessment in 4 hr, updates daily
P4 Low	Minimal or no user impact	Cosmetic, non-production, or fully redundant	None	Ack in 1 business day

Escalation and Role Matrix

Severity	Incident Commander	Technical Lead	Comms Lead	Scribe	Management	Executive
----------	-------------------	---------------	------------	--------	-----------	-----------
P1	Required	Required	Required	Required	Immediate	Within 30 min
P2	Required	Required	Required	Optional	Within 1 hr	If SLA breached
P3	Combined with Tech Lead	Required	Optional	No	Within 4 hr	No
P4	No	Assigned engineer	No	No	Normal reporting	No

Enhanced Monitoring Duration

Severity	Monitoring Period	Alert Threshold	Re-escalation Trigger
----------	------------------	----------------	-----------------------
P1	24 hours	Reduced by 20%	Any recurrence symptom
P2	12 hours	Reduced by 10%	Same failure signature
P3	Next business day	Normal thresholds	Identical alert
P4	None	Normal	Normal process

Decision Trees

Incident Severity Assignment

Event detected or reported
├── Is the service completely unavailable?
│   ├── Yes → Is it a revenue-generating or safety-critical service?
│   │   ├── Yes → P1 Critical
│   │   └── No → P2 High
│   └── No → Service is partially available
│       ├── Are more than 10% of users affected without workaround?
│       │   ├── Yes → P2 High
│       │   └── No → Is there a workaround available?
│       │       ├── Yes → P3 Medium
│       │       └── No, but fewer than 10% of users → P3 Medium
│       └── Is this a non-production or cosmetic issue?
│           └── Yes → P4 Low
├── Is there confirmed or suspected data exposure?
│   ├── Confirmed breach → P1 Critical (regardless of service status)
│   └── Suspected exposure → P2 High minimum
└── Has redundancy been lost on a critical path?
    ├── Yes, no failover remaining → P2 High
    └── Yes, failover still available → P3 Medium

Escalation Decision

Severity assigned
├── P1 or P2?
│   ├── Yes → Assign all four roles immediately
│   │   ├── Is vendor equipment involved in the failure?
│   │   │   ├── Yes → Open vendor TAC case immediately
│   │   │   └── No → Internal investigation first
│   │   └── Has root cause been identified within time window?
│   │       ├── P1: not identified within 30 min → Escalate to next tier
│   │       └── P2: not identified within 2 hr → Escalate to next tier
│   └── P3 or P4?
│       ├── P3 → Assign Technical Lead, monitor for escalation
│       │   └── Impact worsening? → Re-classify severity upward
│       └── P4 → Normal ticket queue, no escalation
└── At any point: if scope expands beyond initial classification
    └── Re-evaluate severity from Step 1, escalate if needed

Report Template

INCIDENT REPORT
=====================================
Incident ID:          [ticket/tracking number]
Severity:             [P1/P2/P3/P4]
Incident Commander:   [name]
Duration:             [detection time] — [resolution time] ([total hours])
Status:               [Resolved / Monitoring / Under Review]

IMPACT SUMMARY:
  Users Affected:     [count or percentage]
  Services Affected:  [list of impacted services]
  Revenue Impact:     [estimated or confirmed]
  Data Impact:        [none / suspected / confirmed — description]

TIMELINE OF EVENTS:
| # | Time (UTC) | Event | Actor | Notes |
|---|-----------|-------|-------|-------|
| 1 | [time] | [event description] | [person/system] | [context] |

ROOT CAUSE:
  Category:           [Process / People / Technology]
  Root Cause:         [description from 5-whys analysis]
  Contributing Factors:
    - [factor 1 — category]
    - [factor 2 — category]

RESOLUTION:
  Fix Applied:        [description of what resolved the incident]
  Validated By:       [how resolution was confirmed]
  Back-out Available: [yes/no — description]

METRICS:
  MTTD:               [time from occurrence to detection]
  MTTI:               [time from detection to root cause identified]
  MTTR:               [time from detection to resolution]
  Recurrence:         [yes/no — reference to prior incident if yes]

ACTION ITEMS:
| # | Action | Type | Owner | Due Date | Status |
|---|--------|------|-------|----------|--------|
| 1 | [action] | [Fix/Mitigate/Accept/Transfer] | [name] | [date] | [status] |

POST-MORTEM STATUS:
  Scheduled:          [date/time or "pending"]
  Attendees:          [roles invited]
  Document Location:  [link to post-mortem document]

Troubleshooting

Severity Disagreement Between Teams

Symptom: Teams classify the same incident at different severity

levels, causing confusion about response urgency.

Resolution: The IC makes the final determination using the

Threshold Tables criteria. The highest applicable severity governs.

Document rationale in the ticket. If the IC is not yet assigned, the

first responder sets initial severity and the IC may adjust.

Escalation Fatigue and Alert Noise

Symptom: Frequent P1/P2 declarations for issues that resolve

quickly, eroding trust in severity classification.

Resolution: Review severity criteria quarterly. Track the

false-positive rate (incidents downgraded after initial classification).

If P1 downgrade rate exceeds 30%, tighten P1 criteria. Ensure P3/P4

incidents are not over-classified.

Post-Mortem Action Items Not Completed

Symptom: Action items accumulate but are not completed, leading

to recurring incidents from known causes.

Resolution: Assign every action item an owner and due date at the

review. Track completion in the incident system, not separate documents.

Review open items in weekly standups. Escalate overdue items to

management and report completion rates alongside MTTD/MTTR.

Communication Gaps During Extended Incidents

Symptom: Status updates become infrequent during long incidents

(>4 hours), leaving stakeholders uninformed.

Resolution: The Communications Lead maintains cadence regardless of

investigation progress. If no new findings exist, state that explicitly

in the update. For incidents exceeding 8 hours, rotate the Comms Lead

role to prevent fatigue.

Incident Recurrence After Resolution

Symptom: The same incident recurs after being marked resolved.

Resolution: Check whether prior post-mortem action items were

completed. If yes, the root cause analysis was incomplete — reconvene

with broader scope. If not, escalate the completion failure. Tag the

new incident as a recurrence and increase severity by one level to

reflect accumulated impact.

Incident Response Lifecycle

概述

Incident Response Lifecycle

When to Use

Prerequisites

Procedure

Step 1: Detection and Classification

Step 2: Triage and Escalation

Step 3: Investigation Coordination

Step 4: Communication Management

Step 5: Resolution and Recovery

Step 6: Post-Incident Review

Threshold Tables

Severity Classification Matrix

Escalation and Role Matrix

Enhanced Monitoring Duration

Decision Trees

Incident Severity Assignment

Escalation Decision

Report Template

Troubleshooting

Severity Disagreement Between Teams

Escalation Fatigue and Alert Noise

Post-Mortem Action Items Not Completed

Communication Gaps During Extended Incidents

Incident Recurrence After Resolution

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

OpenClaw Backup

Free Ride - Unlimited free AI

1password