This skill acts as an advanced diagnostic, resolution, and validation engine for any question or bug report related to the OpenClaw framework itself. v6.0 adds four layers: Runtime Health Check (M1), API Key Validation + Resource Monitor (M2), Unified Diagnosis Report + Regression Check (M3), and Interactive Health Dashboard (M4).
All knowledge storage (memory, logs) and final reports must follow these rules:
sk-).Use this skill when the user:
gateway tool fails with error X").For the vast majority of OpenClaw issues, this sequence provides the fastest path to resolution. Always suggest this flow first when a user reports an unspecified problem or bug!
python scripts/diagnosis_formatter.py which auto-collects all three sources (openclaw doctor + runtime_health_check + api_key_validator) into one severity-sorted report.python scripts/health_dashboard.py --canvas to render the report as an interactive HTML dashboard (embed with [embed ref="health_dashboard" height="740"]).python scripts/diagnosis_formatter.py --save-baseline before making any fix.openclaw doctor --fix or apply suggested fixes manually.python scripts/diagnosis_formatter.py --compare to validate what was fixed, what's new, and what's unchanged.The skill operates by strictly following these steps in sequence, enhanced by proactive layers:
Overview:
A background daemon that periodically polls the Gateway health status. Runs independently from user requests, providing real-time monitoring for anomalies such as Gateway downtime, RPC failures, and configuration drift.
v6.1 Feature Highlights:
| Feature | Description |
|---|---|
| --------- | ------------- |
| 🎯 Real Health Check | Calls openclaw gateway status --json, parses service.runtime.status + rpc.ok |
| 🔇 Noise Filtering | Alerts only after ≥3 consecutive failures; resets after ≥3 consecutive successes |
| 📊 Severity Levels | Four-tier classification (🟢/🟡/🟠/🔴) with auto-escalation |
| 📡 Dual-Channel Alerting | Feishu DM (instant, primary) + WebChat (async thread, secondary) |
| 🔄 Single Instance | Windows Mutex ensures only one daemon runs at a time |
| 📦 Log Rotation | Auto-rotates at 5MB, keeps 3 backup files |
| ⏰ Precise Scheduling | Fixed-minute schedule eliminates cumulative drift |
| 🔐 Hot-Reload Config | Monitors openclaw.json changes and reloads automatically |
| 🖥️ Auto-Start | Registers in HKCU\Run for auto-launch on user login |
| 👋 Startup Confirmation | Sends status to both channels on startup |
| 🐛 Config Cache Fix | Fixed load_gateway_config() returning token=None on cache hit (v6.1) |
| ⏱️ Async WebChat | Fixed background thread with 60s timeout for model loading (v6.1) |
| 📝 Detailed Error Logs | Fixed full stack traces in Feishu + WebChat notifications (v6.1) |
📊 Severity & Alert Rules:
| Consecutive Failures | Level | Behavior |
|---|---|---|
| --------------------- | ------- | ---------- |
| < 2 | 🟢 Level 1 — Normal | Silent, no notification |
| 2 | 🟡 Level 2 — Notice | Silent, continue monitoring |
| ≥ 3 | 🟠 Level 3 — Warning | Trigger notification (first time) |
| ≥ 5 | 🔴 Level 4 — Critical | Trigger notification + repeat every 5 failures |
| Gateway stopped | 🔴 Level 4 — Critical | Immediate notification |
| Recovered for 3 cycles | ✅ Recovered | Send recovery notification |
🚨 Notification Triggers:
openclaw channels add feishuWATCHDOG_FEISHU_USER_ID to your Feishu open_id:```powershell
$env:WATCHDOG_FEISHU_USER_ID = "ou_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
```
```bash
openclaw config set gateway.http.endpoints.chatCompletions.enabled true
openclaw gateway restart
```
> ⚠️ WebChat timeout: The model inference takes ~40s on first load.
> The watchdog uses a background thread with 60s timeout so it doesn't block the main monitoring loop.
Run the Watchdog as a standalone background process:
# Start
python scripts\watchdog_monitor.py
# Install auto-start (launches on user login)
python scripts\watchdog_monitor.py --install
# Remove auto-start
python scripts\watchdog_monitor.py --uninstall
Or use Start-Process for a hidden window:
$py = (Get-Command python).Source
$script = "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts\watchdog_monitor.py"
Start-Process -FilePath $py -ArgumentList $script `
-WindowStyle Hidden `
-WorkingDirectory "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts"
Check process status:
Get-WmiObject Win32_Process -Filter "Name like 'python%'" |
Where-Object { $_.CommandLine -match 'watchdog_monitor' } |
Select-Object ProcessId, @{n="Start";e={$_.CreationDate}}
Stop the Watchdog:
# Find the PID first, then
Stop-Process -Id <PID> -Force
View live logs:
Get-Content "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts\gateway_watchdog.log" -Tail 10 -Wait
View state file:
Get-Content "$env:USERPROFILE\.openclaw\workspace\skills\autofix\scripts\watchdog_state.json" -Raw | ConvertFrom-Json
Watchdog (background daemon, 60s interval)
│
├─ [Channel A — PRIMARY] openclaw message send --channel feishu
│ → Feishu direct message (ou_xxx)
│ → **Instant delivery, zero token cost**
│ → Includes full error stack traces
│
├─ [Channel B — SECONDARY] Gateway HTTP API (/v1/chat/completions)
│ → WebChat live session (agent:main:main)
│ → **Async background thread** (doesn't block monitoring)
│ → 60s timeout for model loading (~40s typical)
│ → token cost: minimal (max_tokens=10)
│
└─ [Log] watchdog_state.json (local check history, last 1440)
gateway_watchdog.log (rotating, 5MB)
Channel priority: Feishu is now the primary channel (instant, reliable via CLI).
WebChat is secondary (async thread, requires model inference).
Channel priority has changed in v6.1:
run autofix self-checkcheck what's wrong with Gatewayauto repairThe Watchdog forms a Proactive Stability Layer, independent of the standard diagnostic flow (Steps 0-5). When an anomaly is detected:
a. The daemon logs the event and generates a System Health Warning (SHW) report
b. Sends a real-time alert (with diagnostic guidance + context JSON)
c. Auto-repair low-risk known issues (e.g., CLI path problems) automatically, then verifies
d. High-risk operations only provide repair suggestions, awaiting user confirmation
Repair Script Library: scripts/auto_repair.py
Matches repair plans based on the diagnostic context from Watchdog alerts:
| Issue | Match Condition | Repair Action | Risk |
|---|---|---|---|
| ------- | ---------------- | --------------- | ------ |
| Gateway stopped | status: stopped | Restart Gateway | 🟡 Needs confirmation |
| RPC connection failed | rpc_ok: false | Restart Gateway | 🟡 Needs confirmation |
| CLI unavailable | status: cli_error | Check installation path | 🟢 Auto-execute |
| HTTP unreachable | status: unreachable | Check port + restart | 🟡 Needs confirmation |
Repair Verification Loop:
Health Trend Tracking:
watchdog_state.json retains the last 1440 check records (24 hours)service.runtime.status = running + rpc.ok = trueopenclaw message send --channel feishu, zero token cost, instant delivery/v1/chat/completions, async background thread, 60s timeoutopenclaw.json for changesHKCU\Run registry, launches on user login--status command — Shows real-time state and exits cleanly (no longer starts daemon by accident)--status processes on daemon startupload_gateway_config() no longer returns token=None on subsequent callsscripts/watchdog_state.jsonscripts/gateway_watchdog.logThis skill strictly follows these steps in sequence, enhanced by proactive layers:
Before any resource-intensive external search or service call, proactively check API quotas, rate limits, and budget consumption for the current active session. If quota-low alerts or known rate-limit thresholds are hit, pause all execution steps and notify the user with a clear "resource warning," requesting they wait or switch to a low-cost / local alternative.
docs/MODULE_02_SearchChain.md — Step 1)docs.openclaw.ai) for official solutionsdocs/MODULE_02_SearchChain.md — Step 2)docs/MODULE_03_ValidationAction.md — Step 3)docs/MODULE_03_ValidationAction.md — Step 4 + docs/MODULE_03_Enhancement_Reports.md)openclaw doctor --fix, exec/write), follow these safety steps:/approvedocs/MODULE_04_Finalization.md)> 💡 Golden Path (Recommended Flow): For most OpenClaw issues, the fastest resolution path is: openclaw doctor → openclaw doctor --fix
When MRE validation fails, generate an interactive diagnostic report using canvas.snapshot() with:
When MRE fails, use LLM-powered analysis to extract root causes from exec output:
Consult the following categorized sub-documents for detailed process explanations:
docs/ — Core Module Documentationdocs/enhancement/ — v5.0 Enhancement Featuresdocs/tutorials/ — Usage Examplesdocs/reports/ — Summary Reportsscripts/ — Python/JS ToolsThis file is the master skill document. It defines the complete problem-solving blueprint and integrates all capability layers.
共 2 个版本