← 返回
未分类 中文

Watchdog Heartbeat

Monitor service health, heartbeat freshness, stuck workflows, and trigger recovery or degraded mode. Use on: high-frequency schedule, after system startup, w...
监控服务健康、心跳新鲜度、工作流卡死情况,并触发恢复或降级模式。适用于高频调度、系统启动后等场景。
sunbinnju-star
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 356
下载
💾 0
安装
1
版本
#latest

概述

Watchdog Heartbeat

Provide observability and recovery awareness for a resident OpenClaw system. Verify process aliveness, heartbeat freshness, and workflow integrity.

Input

Required:

  • service_list — list of monitored services and their expected health states
  • health_endpoints — map of service → health check endpoint or method
  • heartbeat_records — recent heartbeat timestamps per agent/skill
  • workflow_status_records — current status of all active workflows
  • restart_records — history of service restarts and recovery events

Output Schema

service_health_summary: {
  service: string
  status: "healthy" | "degraded" | "down" | "unknown"
  last_check: string      # ISO-8601
  latency_ms: number | null
  error: string | null
}[]

expired_heartbeat_list: {
  agent_or_skill: string
  last_heartbeat: string  # ISO-8601
  seconds_expired: number
  severity: "warning" | "critical"
}[]

stuck_workflow_list: {
  workflow_id: string
  workflow_name: string
  stuck_since: string     # ISO-8601
  stuck_duration_min: number
  last_progress: string | null
  severity: "warning" | "critical"
}[]

recovery_recommendation: {
  action: "restart" | "notify" | "escalate" | "no_action" | "degraded_mode"
  target: string
  reason: string
}[]

degraded_mode_recommendation: {
  affected_services: string[]
  degraded_features: string[]
  estimated_recovery_time: string | null
  user_impact: string
}

watchdog_log: {
  check_id: string
  check_time: string     # ISO-8601
  services_checked: number
  heartbeats_checked: number
  workflows_checked: number
  issues_found: number
  observability_gap: string[] | null
}

Rules

  1. Process alive ≠ healthy. Check recent success, not just process existence.
  2. Expired heartbeat triggers attention. Do not ignore stale heartbeats.
  3. Stuck workflows must be explicitly surfaced. Don't let them disappear into silence.
  4. Silent failure is unacceptable. If something fails and no one is notified, that's a system failure.
  5. Distinguish warning from critical. Warning = may self-recover. Critical = requires intervention.

Heartbeat Expiry Thresholds

Seconds ExpiredSeverity
--------------------------
< 60shealthy
60s – 300swarning
> 300scritical

Workflow Stuck Thresholds

DurationSeverity
--------------------
< 10 minhealthy (in progress)
10 – 30 minwarning
> 30 mincritical

Recovery Actions

  • no_action — within normal parameters
  • notify — alert human, no automatic restart
  • restart — attempt automatic restart
  • escalate — human intervention required
  • degraded_mode — reduce functionality, maintain partial service

Failure Handling

If monitoring data is incomplete:

  • Set observability_gap with missing field names
  • Report status = "unknown" for affected services
  • Do not fabricate health states
  • Recommend escalate if critical services have observability gaps

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 05:56 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Paper Ingest Normalizer

sunbinnju-star
将论文、PDF、URL和文献笔记规范化为结构化研究记录,以便项目记忆和检索。使用时机:(1) 新论文、PDF、DOI 或...
★ 0 📥 383

Weekly Review Builder

sunbinnju-star
构建每周回顾,刷新项目阶段、瓶颈与后续步骤。用于:每周计划、若干日常循环后、项目重置前。
★ 0 📥 370

Daily Loop Runner

sunbinnju-star
为单个活跃项目执行一次受控的每日项目循环。适用于:定时每日运行、计划器触发的项目步骤、项目恢复等。
★ 0 📥 365