← 返回
未分类 Key

pager-triage

AI-powered incident triage for PagerDuty. List active incidents, deep-dive with timeline and alert correlation, check on-call schedules, acknowledge, resolve...
AI驱动的PagerDuty事件分诊:列出活跃事件,深入时间线和告警关联,查看值班表,确认并解决。
tkuehnl
未分类 clawhub v0.1.2 1 版本 100000 Key: 需要
★ 0
Stars
📥 278
下载
💾 2
安装
1
版本
#latest

概述

PagerDuty Incident Triage

AI-powered incident triage for PagerDuty. Read-only by default. Write operations require explicit confirmation.

When to Activate

Use this skill when the user:

  • Asks "what's firing?", "any incidents?", "what's on fire?", or anything about active alerts
  • Mentions PagerDuty, pager, on-call, or incidents
  • Asks "who's on call?" or "who's oncall right now?"
  • Wants to acknowledge, resolve, or add a note to an incident
  • Says "triage", "incident response", or "what happened last night?"
  • Asks about service health or recent incidents for a service

Quick Setup

1. Create a PagerDuty API Key

  1. Go to PagerDuty → Settings → API Access Keys
  2. Click "Create New API Key"
  3. Name it OpenClaw Agent
  4. Select Read-only (recommended for triage; choose Full Access only if you need ack/resolve)
  5. Copy the key

2. Set Environment Variables

# Required — your PagerDuty REST API v2 token
export PAGERDUTY_API_KEY="u+your_key_here"

# Optional — required only for write operations (ack, resolve, note)
export PAGERDUTY_EMAIL="you@company.com"

3. Verify

Ask your agent: "What's firing on PagerDuty?"

Subcommands Reference

Read Operations (always safe, no confirmation needed)


incidents — List Active Incidents

Lists all triggered and acknowledged incidents, sorted by urgency.

pager-triage incidents

Example output:

{
  "tool": "pd_incidents",
  "provider": "pagerduty",
  "timestamp": "2026-02-16T03:45:00Z",
  "total_incidents": 3,
  "incidents": [
    {
      "id": "P123ABC",
      "incident_number": 4521,
      "title": "High CPU on prod-web-03",
      "status": "triggered",
      "urgency": "high",
      "service": { "id": "PSVC123", "name": "Production Web" },
      "created_at": "2026-02-16T03:00:00Z",
      "assignments": [{ "name": "Jane Doe", "email": "..." }],
      "alert_count": 3,
      "escalation_level": 1,
      "last_status_change": "2026-02-16T03:05:00Z"
    }
  ],
  "summary": "3 active incident(s): high (triggered) x1, high (acknowledged) x1, low (triggered) x1"
}

detail — Incident Deep Dive

Full incident details including timeline (log entries), related alerts, notes, and automated analysis.

pager-triage detail P123ABC

Example output (abbreviated):

{
  "tool": "pd_incident_detail",
  "incident": {
    "id": "P123ABC",
    "title": "High CPU on prod-web-03",
    "status": "triggered",
    "urgency": "high",
    "service": { "id": "PSVC123", "name": "Production Web" },
    "escalation_policy": { "name": "Production Escalation" },
    "assignments": [{ "name": "Jane Doe", "escalation_level": 1 }]
  },
  "timeline": [
    { "type": "trigger_log_entry", "created_at": "...", "summary": "Incident triggered via Prometheus Alertmanager" },
    { "type": "escalate_log_entry", "created_at": "...", "summary": "Escalated to Jane Doe (Level 1)" }
  ],
  "alerts": [
    { "id": "A456DEF", "severity": "critical", "summary": "CPU > 95% on prod-web-03", "source": "Prometheus Alertmanager" }
  ],
  "notes": [],
  "analysis": {
    "alert_count": 3,
    "escalation_count": 1,
    "acknowledged": false,
    "trigger_source": "Prometheus Alertmanager"
  }
}

oncall — On-Call Schedules

Shows who's currently on call across all schedules and escalation policies.

pager-triage oncall

Example output:

{
  "tool": "pd_oncall",
  "oncalls": [
    {
      "user": { "name": "Jane Doe", "email": "jane@company.com" },
      "schedule": { "name": "Primary SRE", "id": "PSCHED1" },
      "escalation_policy": { "name": "Production Escalation" },
      "escalation_level": 1,
      "start": "2026-02-15T17:00:00Z",
      "end": "2026-02-16T17:00:00Z"
    }
  ],
  "summary": "2 on-call assignment(s). Primary SRE: Jane Doe. Secondary SRE: Bob Smith."
}

services — List Services

Lists all PagerDuty services with their current operational status.

pager-triage services

Example output:

{
  "tool": "pd_services",
  "services": [
    {
      "id": "PSVC123",
      "name": "Production Web",
      "status": "critical",
      "description": "Production web application servers",
      "escalation_policy": "Production Escalation",
      "integrations": ["Prometheus Alertmanager", "CloudWatch"]
    }
  ],
  "summary": "12 services: 1 critical, 1 warning, 10 active, 0 disabled"
}

recent — Recent Incident History

Shows recent incidents for a service or across all services with summary statistics.

pager-triage recent                          # All services, last 24h
pager-triage recent --service PSVC123        # Specific service
pager-triage recent --since 7d               # Last 7 days
pager-triage recent --service PSVC123 --since 30d

Flags:

FlagDefaultDescription
----------------------------
--service allFilter to a specific PagerDuty service ID
--since 24hTime window: 24h, 7d, or 30d

Example output:

{
  "tool": "pd_recent",
  "period": "last 24 hours",
  "service": "PSVC123",
  "incidents": [ ... ],
  "stats": {
    "total": 5,
    "by_urgency": { "high": 2, "low": 3 },
    "by_status": { "resolved": 4, "triggered": 1 },
    "mean_time_to_resolve_minutes": 42
  }
}

Write Operations (⚠️ require --confirm)

These operations modify state in PagerDuty. They require the --confirm flag AND PAGERDUTY_EMAIL to be set. Without --confirm, the tool displays what it would do and exits.


ack --confirm — Acknowledge Incident

Acknowledges a triggered incident, stopping further escalation.

pager-triage ack P123ABC --confirm

Without --confirm, displays:

{
  "error": "confirmation_required",
  "message": "⚠️ ACKNOWLEDGE INCIDENT — --confirm flag is required to proceed.",
  "incident": { "id": "P123ABC", "title": "High CPU on prod-web-03", "urgency": "high" },
  "hint": "Re-run with --confirm to acknowledge this incident."
}

With --confirm:

{
  "tool": "pd_incident_ack",
  "incident_id": "P123ABC",
  "status": "acknowledged",
  "acknowledged_at": "2026-02-16T03:46:00Z",
  "acknowledged_by": "jane@company.com"
}

resolve --confirm — Resolve Incident

Resolves an incident, marking it as fixed.

pager-triage resolve P123ABC --confirm

Same confirmation pattern as ack. Without --confirm, shows incident details and exits. With --confirm, resolves and returns confirmation JSON.


note --content "text" --confirm — Add Incident Note

Adds a permanent note to an incident's timeline.

pager-triage note P123ABC --content "Root cause: memory leak in auth-service v2.14.3. Rolling back." --confirm

Flags:

FlagRequiredDescription
-----------------------------
--content YesThe note text to add
--confirmYesConfirmation gate

Example output:

{
  "tool": "pd_incident_note",
  "incident_id": "P123ABC",
  "note_id": "PNOTE456",
  "content": "Root cause: memory leak in auth-service v2.14.3. Rolling back.",
  "created_at": "2026-02-16T04:00:00Z",
  "user": "Jane Doe"
}

Incident Triage Workflow

When the user needs help triaging an incident, follow this workflow:

Step 1: Assess the Situation

→ pager-triage incidents

List all active incidents. Prioritize by urgency (high first) and duration (oldest first).

Step 2: Deep-Dive the Critical One

→ pager-triage detail <incident_id>

Get full timeline, alerts, and notes. Identify the trigger source and escalation history.

Step 3: Correlate with Other Skills

If available, use companion skills to investigate root cause:

  • prom-query → Query Prometheus for the underlying metrics (CPU, memory, latency, error rate)
  • kube-medic → Check pod health, restarts, OOMKills, node status in Kubernetes
  • log-dive → Search application logs for errors around the incident timeframe

Step 4: Act

→ pager-triage ack <incident_id> --confirm        # Stop escalation while investigating
→ pager-triage note <incident_id> --content "..." --confirm   # Document findings
→ pager-triage resolve <incident_id> --confirm     # Mark as fixed

Agent Guidance

  • When the user says "what's wrong?" → start with incidents
  • When they mention a specific incident → use detail
  • When triaging → show incidents first, then detail on the most urgent
  • When correlating → suggest prom-query / kube-medic if installed
  • ALWAYS show the confirmation preview before executing write operations
  • NEVER ack/resolve without the user explicitly asking to do so

Discord v2 Delivery Mode (OpenClaw v2026.2.14+)

When the conversation is happening in a Discord channel:

  • Send a compact first response (active incident count, highest urgency incident, recommended next step), then ask if the user wants full detail.
  • Keep the first response under ~1200 characters and avoid full timeline dumps in the first message.
  • If Discord components are available, include quick actions:
  • Deep Dive Incident
  • Acknowledge Incident
  • Add Incident Note
  • If components are not available, provide the same follow-ups as a numbered list.
  • Prefer short follow-up chunks (<=15 lines per message) for incident timelines and alert lists.

Security Notes

  • API keys are read from environment variables only — never logged, displayed, or included in output
  • Read-only by default — 5 read commands work with any API key; 3 write commands require --confirm
  • Confirmation gates — Write operations show full incident context and refuse to proceed without --confirm
  • We recommend starting with a read-only PagerDuty API key for triage workflows
  • See SECURITY.md for the full threat model and RBAC recommendations

OpsGenie Support (Planned)

OpsGenie integration is planned for a future release. When OPSGENIE_API_KEY is set, the same subcommands will map to OpsGenie's REST API with normalized output schemas.


Powered by Anvil AI · Built for the engineer who gets paged at 3am

版本历史

共 1 个版本

  • v0.1.2 当前
    2026-05-12 05:18 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

security-compliance

dep-audit

tkuehnl
审计项目依赖中的已知漏洞(CVE)。支持 npm、pip、Cargo、Go。无需 API 密钥。默认安全:仅报告模式,修复...
★ 0 📥 1,323
security-compliance

vibe-check

tkuehnl
审计代码中的“氛围编程原罪”——即未经适当审查即采纳AI生成代码的模式。生成带评分的评分卡及修复建议...
★ 0 📥 1,263
productivity

log-dive

tkuehnl
统一日志搜索,支持 Loki、Elasticsearch 和 CloudWatch。将自然语言查询转换为 LogQL、ES DSL 或 CloudWatch 过滤模式。只读模式。
★ 0 📥 1,177