概述

Hallucination Guard

4-layer defense against agent fabrication. Each layer is independent — use one or combine.

When Hallucinations Happen

Highest risk conditions (apply more layers when these are present):

Extended sessions (>50 turns or >30min continuous work)
Tasks involving file creation, code, git, or data analysis
Agent reporting quantitative results (numbers, metrics, PnL)
Multiple sequential "successes" with no errors or retries

Layer 0: Context Hygiene (Prevention)

Reduce hallucination probability before it starts.

For long tasks (>10 steps):

Break into segments of ≤8 steps each
Between segments: flush working state to a file, reload from file (not from in-context memory)
Each segment starts with read of the state file — never trust carried-over context for facts

For data-intensive tasks:

Load source data from files at point of use, not from earlier context
If a number was mentioned 20+ turns ago, re-read the source before citing it

Cost: Zero. This is a workflow discipline, not an API call.

Layer 1: Claim-Evidence Protocol (Detection)

Every agent claim of physical action must include tool-verified evidence.

The Rule

CLAIM:    "I created/modified/committed X"
EVIDENCE: Tool output proving X exists and matches the claim
STATUS:   VERIFIED (evidence confirms) or UNVERIFIED (no evidence yet)

Verification Commands by Claim Type

Claim	Verify With
-------	-------------
Created file	`ls -la {path} && head -20 {path}`
Modified file	`grep -n '{expected_content}' {path}`
Git commit	`git log --oneline -3`
Git push	`git log --oneline origin/{branch} -3`
Ran tests	Show actual test output (pass AND fail counts)
API response	Show raw response body
Data analysis	Show `wc -l` of source + sample rows

Red Flags (claim likely fabricated)

Claim references a file but no read/exec tool was called
Exact round numbers in data (187 trades, +$126.50) without source
"All tests passed" with no test output shown
Multiple consecutive successes with zero errors

Cost: ~50 tokens per claim. One exec call per physical claim.

Layer 2: Cross-Model Audit (Verification)

Spawn a second agent (different model) to independently verify claims.

When to Use

Critical outputs: financial reports, deployment decisions, data analysis
When L1 evidence exists but numbers need independent validation
After any task where the agent reported unusually perfect results

How to Run

See references/audit-prompt.md for the spawn template.

Key principles:

Auditor receives ONLY the evidence (files, outputs) — not the original agent's conclusions
Auditor independently extracts facts from evidence and compares to claims
Auditor uses the cheapest model that can do the verification (flash for file checks, sonnet for logic)

Cost: 1 subagent spawn. Use flash/gemini for simple checks (~$0.001). Reserve sonnet/opus for complex logic verification.

Layer 3: Drift Detection (Monitoring)

Monitor long-running agent tasks for hallucination patterns.

When to Use

Tasks expected to take >15 minutes
Agent is working autonomously (coding agent, research agent)
High-stakes tasks where undetected fabrication causes real damage

Setup

See references/drift-monitor.md for implementation.

Core signals:

Claim/Tool Ratio: If claims > 3× tool calls → alert
Zero-Error Streak: 8+ consecutive "successes" with 0 errors → suspicious
Phantom References: Agent references files/branches never created → critical alert

Cost: Periodic check via sessions_history. No extra model calls unless alert triggers.

Choosing Layers

Scenario	Recommended
----------	-------------
Quick file creation	L1 only
Data report from CSV	L0 + L1
Multi-step coding task	L0 + L1 + L2
Autonomous long-running agent	All four layers
Routine conversation	None needed

Integration with Other Skills

War Room: Add L1 verification to each agent's output (verify cited data)
Coding agents: Wrap with L3 drift monitor for long sessions
Any task with sessions_spawn: Add L2 audit as a final verification step

References

references/audit-prompt.md — Cross-model audit spawn template
references/drift-monitor.md — Drift detection implementation
references/taxonomy.md — Hallucination types with real-world examples

版本历史

共 1 个版本

v1.0.0 当前

2026-05-12 06:12 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

Hallucination Guard — 4-Layer AI Fabrication Defense

概述

Hallucination Guard

When Hallucinations Happen

Layer 0: Context Hygiene (Prevention)

Layer 1: Claim-Evidence Protocol (Detection)

The Rule

Verification Commands by Claim Type

Red Flags (claim likely fabricated)

Layer 2: Cross-Model Audit (Verification)

When to Use

How to Run

Layer 3: Drift Detection (Monitoring)

When to Use

Setup

Choosing Layers

Integration with Other Skills

References

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Incident Fupan (事故复盘) — Structured Root Cause Analysis

Self-Improving + Proactive Agent

self-improving agent