← 返回
未分类 中文

Hallucination Guard — 4-Layer AI Fabrication Defense

Detect and prevent AI agent hallucinations during task execution. Use when: (1) an agent claims to have created files, commits, or artifacts — verify them, (...
检测并阻止 AI 代理在任务执行中的幻觉。适用场景:(1)代理声称已创建文件、提交或产物 — 验证它们,(
scytheshan-pixel scytheshan-pixel 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 166
下载
💾 2
安装
1
版本
#latest

概述

Hallucination Guard

4-layer defense against agent fabrication. Each layer is independent — use one or combine.

When Hallucinations Happen

Highest risk conditions (apply more layers when these are present):

  • Extended sessions (>50 turns or >30min continuous work)
  • Tasks involving file creation, code, git, or data analysis
  • Agent reporting quantitative results (numbers, metrics, PnL)
  • Multiple sequential "successes" with no errors or retries

Layer 0: Context Hygiene (Prevention)

Reduce hallucination probability before it starts.

For long tasks (>10 steps):

  1. Break into segments of ≤8 steps each
  2. Between segments: flush working state to a file, reload from file (not from in-context memory)
  3. Each segment starts with read of the state file — never trust carried-over context for facts

For data-intensive tasks:

  • Load source data from files at point of use, not from earlier context
  • If a number was mentioned 20+ turns ago, re-read the source before citing it

Cost: Zero. This is a workflow discipline, not an API call.

Layer 1: Claim-Evidence Protocol (Detection)

Every agent claim of physical action must include tool-verified evidence.

The Rule

CLAIM:    "I created/modified/committed X"
EVIDENCE: Tool output proving X exists and matches the claim
STATUS:   VERIFIED (evidence confirms) or UNVERIFIED (no evidence yet)

Verification Commands by Claim Type

ClaimVerify With
--------------------
Created filels -la {path} && head -20 {path}
Modified filegrep -n '{expected_content}' {path}
Git commitgit log --oneline -3
Git pushgit log --oneline origin/{branch} -3
Ran testsShow actual test output (pass AND fail counts)
API responseShow raw response body
Data analysisShow wc -l of source + sample rows

Red Flags (claim likely fabricated)

  • Claim references a file but no read/exec tool was called
  • Exact round numbers in data (187 trades, +$126.50) without source
  • "All tests passed" with no test output shown
  • Multiple consecutive successes with zero errors

Cost: ~50 tokens per claim. One exec call per physical claim.

Layer 2: Cross-Model Audit (Verification)

Spawn a second agent (different model) to independently verify claims.

When to Use

  • Critical outputs: financial reports, deployment decisions, data analysis
  • When L1 evidence exists but numbers need independent validation
  • After any task where the agent reported unusually perfect results

How to Run

See references/audit-prompt.md for the spawn template.

Key principles:

  1. Auditor receives ONLY the evidence (files, outputs) — not the original agent's conclusions
  2. Auditor independently extracts facts from evidence and compares to claims
  3. Auditor uses the cheapest model that can do the verification (flash for file checks, sonnet for logic)

Cost: 1 subagent spawn. Use flash/gemini for simple checks (~$0.001). Reserve sonnet/opus for complex logic verification.

Layer 3: Drift Detection (Monitoring)

Monitor long-running agent tasks for hallucination patterns.

When to Use

  • Tasks expected to take >15 minutes
  • Agent is working autonomously (coding agent, research agent)
  • High-stakes tasks where undetected fabrication causes real damage

Setup

See references/drift-monitor.md for implementation.

Core signals:

  • Claim/Tool Ratio: If claims > 3× tool calls → alert
  • Zero-Error Streak: 8+ consecutive "successes" with 0 errors → suspicious
  • Phantom References: Agent references files/branches never created → critical alert

Cost: Periodic check via sessions_history. No extra model calls unless alert triggers.

Choosing Layers

ScenarioRecommended
-----------------------
Quick file creationL1 only
Data report from CSVL0 + L1
Multi-step coding taskL0 + L1 + L2
Autonomous long-running agentAll four layers
Routine conversationNone needed

Integration with Other Skills

  • War Room: Add L1 verification to each agent's output (verify cited data)
  • Coding agents: Wrap with L3 drift monitor for long sessions
  • Any task with sessions_spawn: Add L2 audit as a final verification step

References

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-12 06:12 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

it-ops-security

Incident Fupan (事故复盘) — Structured Root Cause Analysis

scytheshan-pixel
事故复盘 / Incident Fupan — structured root cause analysis for production failures, outages, bugs, and near-misses. Use when
★ 0 📥 813
ai-agent

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,398 📥 323,099
ai-agent

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,110 📥 831,446