← 返回
未分类 中文

Services Watchdog

Set up a systemd-based watchdog that keeps long-running Node.js services (Telegram bots, Express dashboards, etc.) alive across shell exits, ssh disconnects,...
设置基于 systemd 的看门狗,使 Node.js 长运行服务(如 Telegram 机器人、Express 仪表盘等)在 shell 退出、SSH 断开等情况下保持运行。
dyagil
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 257
下载
💾 0
安装
1
版本
#latest

概述

Services Watchdog

Problem

Long-running Node services launched from a parent shell (or as children of an agent runtime) die when the parent exits. Runtime restarts are especially aggressive — they tend to take down everything they spawned as collateral damage. Manual nohup/setsid rituals survive an ssh disconnect but not a reboot.

Architecture

my-watchdog.timer          (systemd --user; OnUnitActiveSec=2min)
    ↓
my-watchdog.service        (Type=oneshot; KillMode=process)
    ↓
services-watchdog.sh
    ↓
for each service: check → if down → systemd-run --user --scope → exec node

Two non-obvious details make this actually work:

  1. KillMode=process + systemd-run --user --scope — without this, systemd kills the children of a Type=oneshot service as soon as the service exits. The combination puts each restarted service in its own transient scope, outside the watchdog's cgroup.
  2. .env is loaded INSIDE the new scope. The watchdog wraps the start command in bash -c 'cd && set -a && . ./.env; set +a && exec node '. This propagates every env var without the watchdog having to know which ones the service needs (TELEGRAM_BOT_TOKEN, OPENAI_API_KEY, …).

Files

Rename the unit files to match your own prefix (e.g. mybot-watchdog.*) when adopting.

Install

WORKSPACE="$HOME/.openclaw/workspace"     # or wherever your projects live
mkdir -p "$WORKSPACE/scripts" "$WORKSPACE/logs" ~/.config/systemd/user

cp scripts/services-watchdog.sh   "$WORKSPACE/scripts/"
cp scripts/sahi-watchdog.service  ~/.config/systemd/user/
cp scripts/sahi-watchdog.timer    ~/.config/systemd/user/
chmod +x "$WORKSPACE/scripts/services-watchdog.sh"

systemctl --user daemon-reload
systemctl --user enable --now sahi-watchdog.timer
loginctl enable-linger "$USER"   # keeps the timer running when not logged in

Verify

# State after most recent run:
cat ~/.openclaw/workspace/memory/watchdog-state.json
# Recent recoveries / failures:
tail ~/.openclaw/workspace/logs/watchdog.log
# Schedule:
systemctl --user list-timers sahi-watchdog.timer --no-pager

End-to-end test (replace 4321 with the port your service listens on):

PID=$(ss -tlnp 2>/dev/null | awk '/:4321 /{print $NF}' | grep -oP 'pid=\K[0-9]+' | head -1)
kill "$PID"
systemctl --user start sahi-watchdog.service   # don't wait 2 min
ss -tln | grep 4321                            # should be listening again

(Do NOT use pkill -f "myservice/server.js" to kill the test target — your own exec shell often matches the same regex and gets SIGTERM'd.)

Adapt to a New Service

In services-watchdog.sh, add three things and append the service name to the services=() array:

check_myservice() {
  pgrep -f "<unique-marker-in-cmdline>" >/dev/null 2>&1
}

restart_myservice() {
  cd "$WORKSPACE/projects/myservice" || return 1
  systemd-run --user --scope --quiet --unit="myservice-$(date +%s%N)" \
    --setenv=PATH="$PATH" --setenv=HOME="$HOME" \
    bash -c 'cd '"$WORKSPACE"'/projects/myservice && set -a && [ -f .env ] && . ./.env; set +a && exec nohup node src/index.js >> logs/svc.log 2>&1 < /dev/null' &
  disown 2>/dev/null || true
  sleep 3
  check_myservice
}

labels_myservice="My Service"

Gotchas (Learned the Hard Way)

  • Don't use Type=simple for the systemd service — that keeps the watchdog itself alive long after it should have exited, and it re-enters every 2 minutes.
  • PATH inside systemd-run --user --scope is minimal. Always pass --setenv=PATH="$PATH" if a child relies on ~/.npm-global/bin or similar; or call binaries by absolute path.
  • pgrep -f matches the watchdog shell itself. Use a unique marker (file path) when defining check_*, e.g. pgrep -f "myservice/src/index", not just pgrep -f "node src/index.js" which can collide with other projects.
  • Type=oneshot with default KillMode=control-group kills the children you just spawned. Always set KillMode=process AND launch via systemd-run --user --scope so the new process lives outside the watchdog's cgroup.

See Also

  • A taskflow or cron skill for one-shot scheduled tasks. The watchdog is for "always-on" services, not periodic jobs.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-21 14:30 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Oganim Deploy

dyagil
开发、部署并验证对 Vercel + Supabase 项目(营销网站、CRM 和客户门户)的更改。适用于代理需要发布代码时。
★ 0 📥 295

Magic Link Bridge

dyagil
生成 Supabase 魔术链接,直接跳转到自定义门户子路径(例如 /portal/),而不是被静默重写为项目站点 URL
★ 0 📥 229

Firecrawl

dyagil
通过 Firecrawl API 为 AI 代理抓取、搜索、映射和爬取网页,适用于需要从 JS 重度或 SPA 站点获取干净 Markdown 的场景。
★ 0 📥 232