> "The agent knew it was wrong. The knowledge didn't matter." — PocketOS log, 2026
A mandatory safety skill that intercepts destructive AI agent operations before execution. It employs a Context-Aware Risk Scoring (CARS) system to balance security with operational velocity.
This skill is mandatory. No opt-out. No override by the executing agent.
Based on the principle that reasoning is not a guardrail.
BEFORE any tool call:
1. SCAN operation against DESTRUCTIVE taxonomy
2. IF destructive → ENTER Guardian Protocol
3. EVALUATE Risk Level via CARS Matrix
4. EXECUTE Decision Path:
- LOW: Auto-Approve (Log only)
- MEDIUM: Fast-Track (Verify Backup → Proceed)
- HIGH: Hard Block (Verify Backup → Human Approval)
5. IF JIT Window Active → Override High-Risk prompt (Proceed if Backup Verified)
| Risk Level | Trigger Criteria | Action | Verification Required |
|---|---|---|---|
| :--- | :--- | :--- | :--- |
| Low | Files in /tmp, sandbox/, or .cache; Single file deletions in non-critical paths. | Auto-Approve | None (Log only) |
| Medium | Edits to .config or .env files; Deletions of < 5 files in a Git-tracked directory. | Fast-Track | Verified backup required (Git, snapshot, or cloud sync) |
| High | rm -rf on root/home; DROP TABLE; Edits to system files; Mass file deletions (>10). | Hard Block | Mandatory backup verification + Human Approval required regardless of backup status |
| Scenario | Action |
|---|---|
| ---------- | -------- |
| ANY destructive operation | Backup verification required |
| Low risk + verified backup | PROCEED |
| Low risk + no backup | PROCEED with warning |
| Medium risk + verified backup | PROCEED |
| Medium risk + no backup | HALT + Human approval required |
| High risk | ALWAYS HALT + Human approval required |
| Repeated same pattern | Flag pattern, require operator review |
A JIT (Just-In-Time) window can temporarily downgrade High to Medium risk, but never eliminates the human approval requirement for High risk. Human approval is always required for High-risk destructive operations.
Every tool call is scanned against the taxonomy above. No agent discretion. No "I know what I'm doing."
VERIFY-BACKUP(target):
1. Check if target is covered by active backup system
2. Common indicators:
- .git repository with clean status
- Time Machine / File History active on target volume
- Cloud sync (OneDrive, Dropbox, Google Drive, iCloud) with recent sync
- Explicit backup tool (restic, duplicity, rsnapshot) with recent snapshot
- Versioned storage (ZFS snapshots, S3 versioning)
3. IF any indicator active AND recent → RETURN VERIFIED
4. ELSE → RETURN UNVERIFIED
Fast path: Backup verification must complete in <2 seconds. No long-running checks.
| Backup Status | Risk Level | Action |
|---|---|---|
| --------------- | ----------- | -------- |
| VERIFIED ACTIVE | Low / Medium | PROCEED with execution |
| VERIFIED ACTIVE | High | HALT and ESCALATE to human |
| UNVERIFIED | Any | HALT and ESCALATE to human |
| UNKNOWN | Any | Treat as UNVERIFIED — HALT and ESCALATE |
Sidenote: If a JIT Window is active, High Risk operations are downgraded to "Fast-Track" (Proceed if Backup Verified).
When escalation is required, Guardian MUST output:
🛡️ GUARDIAN HALT
Operation: [specific tool call]
Target: [file/path/database/endpoint]
Category: [taxonomy category]
Risk Level: [CRITICAL/HIGH/MEDIUM]
Backup Status: [UNVERIFIED / last backup: X hours ago]
Proposed Action: [what the agent wants to do]
Potential Impact: [what could go wrong]
Options:
1. APPROVE — Proceed with execution (human responsibility)
2. DENY — Cancel operation
3. SNAPSHOT — Create quick backup first, then proceed
4. REVIEW — Agent provides additional justification
Guardian awaits human decision.
Guardian operates at the tool-call layer, between the agent's decision and the tool's execution:
Agent Decision → Guardian Intercept → [Verify Backup] → Execute OR Escalate
If the runtime doesn't support interception, Guardian operates as a mandatory pre-flight check:
BEFORE calling any tool:
1. Agent MUST call Guardian check
2. Guardian returns PROCEED or HALT
3. Agent respects HALT, awaits escalation resolution
Every Guardian decision is logged:
[Timestamp] [Operation] [Category] [Backup Status] [Decision] [Approver]
Logs are append-only. No deletion by the executing agent.
Sidenote: All operations within a JIT window are tagged with [JIT-GRANTED] in the audit log.
Vanilla: This skill is generic. Not specific to any agent, platform, or deployment.
Mandatory: Once enabled, all sessions load this skill. No opt-out.
Non-Blocking (when safe): Backup-verified operations proceed without delay. No human wait for routine maintenance with verified backups.
references/OPERATION-TAXONOMY.md — Full destructive operation classificationreferences/DECISION-MATRIX.md — Detailed backup verification logic and escalation rulesscripts/verify-backup.ps1 — Windows backup detection scriptscripts/verify-backup.sh — Linux/macOS backup detection script共 3 个版本