> Universal token audit & optimization framework for OpenClaw agents.
> Based on real-world practice (2026-05-04).
uses expensive ones. Don't mix the two.
noise to an agent. Strip them.
three calls for 3+3+4 results. Combine.
registered, every past message injected — all have a token price.
nothing. Configure active hours.
After each full execution, write a report (token-audit-report-YYYY-MM-DD.md)
containing: before/after comparison table, estimated weekly savings per change,
items deferred and why, recommended next step.
Not every audit needs the full Phase 1-5 treatment. Use these shortcuts
based on your goal:
Trigger: "快速 token 审计"
Trigger: "快速省 token"
Go straight to these high-impact, zero-risk techniques:
Apply directly without audit preamble.
| Indicator | Action |
|---|---|
| ----------- | -------- |
| First time using this skill | Full Phase 1-5 — establish baseline |
| Cron tasks changed significantly | Phase 1-3 — re-discover + re-optimize |
| New provider/API added | Phase 1B + 3 — check config + optimize |
| >30 days since last audit | Phase 1-2 — measure drift, then re-optimize |
| Just want a quick check | Quick Wins or Express Audit |
Read your cron/scheduled task configuration (e.g. ~/.openclaw/cron/jobs.json).
For each task record:
namemodel (or "default" if unset)message / prompt length in charsschedule frequency (daily / weekly / other)delivery.mode (announce / none)sessionTarget (isolated / main)Inspect your gateway config (e.g. openclaw.json):
agents.defaults.heartbeat.* — interval, active hours, isolated session,light context flag
agents.defaults.compaction.mode — message retention aggressivenessagents.list[].tools.profile — full, coding, or customagents.list[].model — per-agent model overrideList every file that is injected at session start (typically files in the
workspace root directory). Measure each in chars and estimate token cost
(~3 chars per token for CJK-heavy text, ~4 for English-heavy).
If LCM (Lossless Context Management) is active, note the number and average
size of compacted summary blocks injected per turn.
If tool schemas are accessible, estimate total schema chars:
(count of registered tools × average schema size in chars).
Categorize all available models into three tiers based on capability and cost:
Map each task from 1A to its current model tier.
> ⚠️ Checkpoint: Before moving to Phase 2, present your Phase 1 findings
> (task inventory, file sizes, model tier map) to the user.
> Confirm that the inventory is complete and the measurements are correct.
> This prevents optimizing the wrong things.
Score each finding from Phase 1 along three independent dimensions:
| Dimension | Scale | Assessment |
|---|---|---|
| ----------- | ------- | ------------ |
| Token Impact 🎯 | High / Med / Low | Tokens per occurrence × occurrences per period |
| Risk ⚠️ | Safe / Moderate / High | Can you undo it? Does it affect core function? |
| Effort 🔧 | Easy / Med / Hard | Single config change? Multi-file edit? Needs research? |
Compute a relative priority for each finding by inverting Risk and Effort:
Priority = ImpactWeight × (1 / RiskWeight) × (1 / EffortWeight)
Where each dimension maps to a simple numeric weight:
Focus on items scoring ≥ 1.5 first. Skip items < 1.0 unless they are
trivially easy (effort=1) and safe (risk=1).
These patterns tend to score high across most deployments:
| Pattern | Typical Impact | Typical Risk | Typical Effort |
|---|---|---|---|
| --------- | --------------- | ------------- | ---------------- |
| Overly verbose task prompts | High | Safe | Easy |
| Heavy models on simple tasks | High | Safe | Easy |
| No active hours on heartbeat | Med-High | Safe | Easy |
| Duplicated content across bootstrap files | Med-High | Safe | Easy-Med |
| Full tool profile on task-specific agents | High | Moderate | Easy |
| Idle-time session not configured | Med | Safe | Easy |
| Outdated tool/plugin configs still loaded | Low-Med | Safe | Easy |
> ⚠️ Checkpoint: Show your top-3 priority items to the user.
> Confirm direction before starting optimization.
> If the highest-score items seem wrong, revisit Phase 1 measurements.
> ⚠️ User confirmation gate: Techniques marked Moderate or High risk
> involve config changes, profile switches, or task merging. Before applying them,
> present the proposed change using this template and get explicit approval:
>
> ```
> ## Proposed Change
> Technique: [category/technique name]
> Target: [file/config path]
> Before: [current state, chars/tokens if measurable]
> After: [proposed state, estimated savings]
> Risk: [Moderate/High]
> Rollback: [how to undo]
> ```
>
> Techniques marked Safe can be applied directly.
Each category below contains a set of techniques. Apply them in priority
order from Phase 2 — start with the highest-score items first, regardless
of which category they fall into.
If a technique causes a problem:
or re-create it from the original prompt.
pre-optimization prompt versions in a prompts/backup/ directory).
Match your Phase 2 findings to the best starting category:
| Finding | Start With |
|---|---|
| --------- | ----------- |
| Verbose task prompts (background context, philosophy) | A Prompt Simplicity |
| Heavy models on simple automation tasks | B Model Tiering |
| Bootstrap files >2K chars each, duplicated content | C Context Slimming |
| Full tool profile, rarely-used tools registered | D Tool Profile Optimization |
| Verbose agent output, too many turns per task | E Output Discipline |
| No active hours, co-located tasks running separately | F Session Lifecycle |
| Repeated system prompts without caching structure | G Provider-Side Caching |
| Agent retries failed approaches instead of switching | H Behavioral Discipline |
| Simple/complex tasks both use premium model | J Intelligent Model Routing |
If you're not sure which category to start with, follow this tree from top
to bottom — the first match tells you your likely best starting category:
1. Is the main session slow or expensive?
→ Check B (tiering) and J (routing)
→ Also check D (too many tools loaded?)
2. Are cron jobs consuming more than expected?
→ Check A (prompts too wordy?), then B (wrong model?)
→ If F (same-tier jobs not batched?)
3. Is context getting cut off mid-task?
→ Check C (bootstrap too large?) → I (progressive disclosure?)
→ Then J3 (incremental delivery?)
4. Are agent outputs too verbose?
→ Check E (output discipline) → H (behavioral discipline)
5. Is the same heavy prompt repeated across tasks?
→ Check G (provider-side caching: fixed prefix first?)
6. Are you seeing the same errors repeatedly?
→ Check H2 (fail once, switch) → H4 (fix root cause)
7. Default (no obvious symptom):
Run Phase 1 from scratch → Phase 2 will tell you where to go
> Pro tip: Start with G (Provider-Side Caching) if you use DeepSeek.
> Cache pricing is 0.83% of uncached — fixing prefix structure alone
> can cut token costs by 90%+.
| Technique | Description | Risk |
|---|---|---|
| ----------- | ------------- | ------ |
| A1 Strip preamble | Remove background/rationale paragraphs from task prompts. Keep only: trigger, action, output format.
Before: "你是系统监控助手。每天检查服务器状态:CPU使用率>80%告警、内存>90%告警、磁盘>85%告警、SSL证书<30天告警。每个告警按严重程度分别处理:严重→立即通知值班、一般→发运维邮件、提示→记录日志。"
After: "系统监控。检查:CPU(>80%) Mem(>90%) Disk(>85%) SSL(<30d)。告警:严重→立即、一般→邮件、提示→日志。" (360→110 chars, -69%) | Safe |
| A2 Bullet points > prose | Replace multi-sentence descriptions with keyword checklists. | Safe |
|---|---|---|
| A3 Constrain output | Add "Answer concisely in ≤3 lines" or equivalent to reduce generated tokens. | Safe |
| A4 Remove redundancy | Delete "What NOT to do" sections — proper instructions make negatives implicit. | Safe |
| A5 Reference > inline | Replace full instructions for sub-tasks with file references ("See X.md") when the referenced file is always loaded. | Safe |
| Technique | Description | Risk |
|---|---|---|
| ----------- | ------------- | ------ |
| B1 Right-size each task | Map every automated task to the cheapest model that can do it adequately. Test borderline cases. | Safe |
| B2 Define tier boundaries | Document which model(s) belong to each tier so new tasks are assigned correctly. | Safe |
| B3 Batch same-tier runs | Schedule same-tier tasks back-to-back to reuse the same session (single context load). | Moderate |
| Technique | Description | Risk |
|---|---|---|
| ----------- | ------------- | ------ |
| C1 Measure every boot file | List all files loaded at session start and identify those > 2K chars for potential trimming. | Safe |
| C2 Cross-reference dedup | When the same content appears in 2+ files (e.g. "Core Principles" in SOUL.md and IDENTITY.md), keep it in one authoritative file and replace the others with a 详见 reference. | Safe |
| C3 Archive aged-out content | Move old diary entries, superseded milestones, and historical promoted entries to a dedicated archive directory. | Safe |
| C4 Trim to one-liner | Convert verbose descriptions to single-line summaries.
Before: "This project's coding conventions were established after three code reviews revealed inconsistent patterns: use 2-space indent for HTML/CSS, 4-space for Python, tabs for Go. Prefix private methods with underscore. No Hungarian notation. Import order: stdlib, third-party, local."
After: "Coding conventions (see CONTRIBUTING.md) — 6 rules, numbered."
Actionable instructions stay; background context goes. | Safe |
| Technique | Description | Risk |
|---|---|---|
| ----------- | ------------- | ------ |
| D1 Size your tool schema | Count all registered tools and estimate total schema chars. This is typically the single largest per-turn overhead. | Safe (measure only) |
| D2 Switch profile per agent | Use "coding" profile for sub-agents/cron jobs (excludes browser, canvas, media generation, feishu tools). Use "full" only where those tools are actually needed. | Moderate (test on sub-agents first) |
| D3 Disable unused tools | If you have disabled skills or orphaned plugin tools still registering schemas, disable or remove them from the registry. Check skills.entries and plugins.load.paths. | Safe |
| D4 Create custom profile | If neither "full" nor "coding" fits, define a custom profile with exactly the 15-25 tools your use-case needs. Requires config reload. | High |
| Technique | Description | Risk |
|---|---|---|
| ----------- | ------------- | ------ |
| E1 No operation narration | Remove "I'll...", "Let me check..." patterns. Do the action directly. | Safe (behavioral) |
| E2 Lead with conclusion | Put the answer first. Add explanation only when needed. | Safe (behavioral) |
| E3 Batch turns | Read → plan → apply all changes in as few turns as possible, instead of read→think→edit→think→verify per-item. Each extra turn adds LCM context overhead. | Safe (behavioral) |
| E4 Sub-agent conciseness | When spawning sub-agents, specify a concise return format. Their full output is injected into context if returned. | Safe |
| Technique | Description | Risk |
|---|---|---|
| ----------- | ------------- | ------ |
| F1 Set active hours | Configure heartbeat.activeHours so no work runs during idle time (overnight, weekends). | Safe |
| F2 Isolated sessions | Set heartbeat.isolatedSession: true so periodic checks don't accumulate in the main session. | Safe |
| F3 Light context | Set heartbeat.lightContext: true to skip loading all bootstrap files — only HEARTBEAT.md is injected. | Safe |
| F4 Merge co-located tasks | If two cron jobs run within minutes of each other (e.g. both at 23:xx), merge them into one session with a combined prompt. Copy both prompts into one job's message field separated by a blank line, then remove the later job. Saves one full startup context per day. | Moderate |
| F5 Merge example | Before: Job A at 23:00 (System health check), Job B at 23:10 (Log cleanup). After: Single job at 23:00 with prompt "Do A then B.~A: ...~B: ..." | Moderate |
| F6 Configure queue | If the platform supports message queue settings (debounce, collect), tune them to prevent rapid-turn accumulation during tool execution. | Safe |
> Impact is 10× any other category. DeepSeek V4 Pro cached price is 0.83% of
> uncached. Cache hit rates of 91-96% are achievable with proper prompt structure.
| Technique | Description | Risk |
|---|---|---|
| ----------- | ------------- | ------ |
| G1 Fixed prefix first | Design all prompts as [static prefix] + [dynamic suffix]. Static prefix includes system instructions, bootstrap summary, and tool schemas. Dynamic suffix includes runtime instruction. This maximizes KV cache hits on the provider side.
Wrong: "Analyze this code for memory leaks...你是代码审查助手,审查规则如下:..."
Right: "你是代码审查助手,审查规则如下:...现在分析这段代码的内存泄漏:..." | Safe |
| G2 Session contiguity | Don't insert unrelated messages between consecutive calls to the same model — this breaks the KV cache prefix. Batch related calls into a single turn instead. | Safe |
|---|---|---|
| G3 Monitor cache rate | Check provider dashboards for cache hit rate. If <80%, your prefix structure likely has variability. Fix it. | Safe |
| G4 Route to best caching provider | Different providers have wildly different cached prices. DeepSeek V4 Pro: 0.83% of uncached. MiniMax: ~20%. Route routine tasks to the provider with the best cache economics. | Moderate |
> These are zero-config, zero-cost techniques. The savings come from how you use
> the system, not how it's configured.
| Technique | Description | Risk |
|---|---|---|
| ----------- | ------------- | ------ |
| H1 Default to working path | Use known-working tools before alternatives. Don't retry tools known to be broken in the current deployment — each retry is a wasted tool call + error response.
Bad: web_search (broken) → error → web_search again → error → baidu-search → works
Good: baidu-search → works (first attempt) | Safe |
| H2 Fail once, switch | If a method fails, switch immediately to a known alternative. Don't retry the same approach with slightly different parameters. Each retry costs full tool-call tokens. | Safe |
|---|---|---|
| H3 Batch > Poll | Gather all data before acting instead of incrementally. One exec or read call that returns 10 results costs less than 5 separate calls returning 2 each. | Safe |
| H4 Fix root cause | If a tool works inconsistently due to a known config issue (API key expired, wrong provider), fix the config. Working around it each time costs more in accumulated failed calls. | Safe |
> Context Engineering 是 2026 年从 Prompt Engineering 演进出的上层方法论。
> 核心原则:渐进式披露 (Progressive Disclosure) — 仅在任务需要时加载特定模块。
> 北京大学论文《Meta Context Engineering via Agentic Skill Evolution》实测:
> Token 消耗降低 60%,任务成功率提升 45%。
| Technique | Description | Risk |
|---|---|---|
| ----------- | ------------- | ------ |
| I1 Progressive disclosure | 按任务类型分级加载系统能力描述。不需要的技能说明不加载。与 token-saver 的 C1-C4 互补:C 系列负责文件级裁剪,I1 负责能力级裁剪。
Wrong: Agent 启动时加载全部技能描述
Right: Agent 启动时按任务类型加载对应技能(搜索类任务只加载搜索相关技能描述) | Safe |
| I2 Semantic context scoring | 用语义重要性评分代替固定滑动窗口。计算历史交互与当前 query 的语义相似度,只保留 top-3 + 最近 1 轮。
推荐实现:Sentence-BERT 计算余弦相似度,按得分排序裁剪到 token budget。
参考开源项目:Agent Skills for Context Engineering | Safe |
| I3 Fixed prefix pattern | 设计所有 prompt 为「static prefix + dynamic suffix」。静态前缀包括系统指令、bootstrap 摘要、工具 schema 描述。动态后缀包括本次运行指令。最大化 provider 端 KV cache 命中。
Wrong: 每次运行时先写任务指令再补系统描述
Right: 系统描述在前,任务指令在后 | Safe |
| I4 Tool call circuit breaking | 在调用高成本外部工具(SQL 查询、图像生成、文件写入)前增加预检代理。校验参数合法性 + 资源预估。超限或非法直接拒绝,不触发实际调用。
适用场景:web_fetch 超长 URL、exec 高风险命令、文件写入大文件。 | Safe |
| I5 Cost-aware context pruning | 按 taskType 和 budget tokens 动态决定上下文精度。简单任务用精简 prompt,复杂任务才加载完整能力描述。
实现:在 cron payload 中嵌入 task_type 字段,根据该字段调度上下文模板。 | Safe |
> 基于 OpenSquilla(开源 Token 优化引擎)方案。
> Core insight: 路由决策本身不消耗 Token — 在本地判定任务复杂度后决定走哪个模型。
> 与 B 系列(Model Tiering)的区别:B 说"每个任务应该固定分配一个 tier",
> J 说"同一个任务的每一次调用动态判断走哪个 tier"。
| Technique | Description | Risk |
|---|---|---|
| ----------- | ------------- | ------ |
| J1 Dynamic inbound grading | 对每次入站请求做轻量复杂度判定(prompt length + expected reasoning depth + tool count),自动路由到对应 tier 的模型。
判定规则示例:
实现:在 cron/子代理 payload 中嵌入 model 字段,或使用中间路由代理做分类。
Before: 所有每日 cron 统一用 deepseek-v4-pro,包括简单版本检查
After: 版本检查→sensenova-6.7-flash-lite,领域探针→deepseek-v4-flash,
深度分析→deepseek-v4-pro。按任务特征分级,非一刀切。 | Safe |
| J2 Local routing decision | 路由判定在本地完成(不需要发 request 到远程判断路由)。
方案:在任务 payload 中预置 routing_hint 字段(economy/standard/premium),
由编排层基于任务特征(prompt length、任务类型、预期工具调用数)决定。
Bad: 每次调用先发一个轻量请求问"这个走哪个模型?"(浪费 Token)
Good: 任务描述本身包含路由线索,编排层本地匹配 | Safe |
| J3 Incremental context delivery | OpenSquilla 核心方案之一:先发送最小必要上下文,不足时增量补充。
与 C 系列(Context Slimming)区别:C 是文件级裁剪(减少启动加载量),
J3 是调用级递进(先发主干,模型需要时再发细节)。
实现:prompt 中先给出高度压缩的摘要版本,后面用"如需详情请见"标记展开点。
结合 OpenClaw 的 lossless-claw 机制:LCM 摘要本身已经是一个渐进式实现。 | Safe |
| J4 Cache-aware routing | 统计各 provider 的 KV cache 命中率,优先路由到命中率高的 provider。
当前已知:DeepSeek V4 Pro cache price = 0.83% of uncached。
如果命中了缓存的 prompt prefix(含 bootstrap + 技能描述),成本降低 99%+。
实现:在路由时优先选择上次使用同一模型+同一 prefix 结构的 provider。 | Moderate |
| J5 Fallback chain | 当分配的路由模型不可用(API 超时/限流/报错),自动降级到备用模型。
与 H2(Fail once, switch)的区别:H2 是行为纪律(失败了就换方法),
J5 是配置级的自动 failover 链。
示例 Fallback Chain:
1️⃣ deepseek-v4-pro → 2️⃣ deepseek-v4-flash → 3️⃣ sensenova-6.7-flash-lite
实现:在 cron/子代理 payload 中使用 fallbacks 字段配置降级顺序。
Before: cron job 用 minimax-m2.7,卡死后无降级 → 任务永远失败
After: 同一 cron 配置 fallbacks: ["deepseek-v4-flash", "deepseek-v4-pro"] →
minimax 卡死后自动切到 deepseek-v4-flash 执行 | Safe |
> 权衡:智能路由的收益上限取决于任务复杂度分布。如果 80% 的任务是简单查询,
> 智能路由的 Token 节省可达 60-80%。如果大部分任务已经是标准/经济 tier,
> 额外收益有限。建议做一次任务复杂度分布扫描后再选择启用哪些 J 技术。
Before/after comparison of all modified prompts and files. Include total
chars and estimated tokens saved.
After editing JSON configuration files, validate:
python3 -c "import json; json.load(open('<config-path>')); print('OK')"
cron action=runs or nextscheduled trigger)
Write token-audit-report-YYYY-MM-DD.md summarizing:
Log each optimization cycle in results.tsv (see skill directory for
format reference). This creates an audit trail for the quarterly deep audit (5B).
Optionally create a weekly cron (cheapest available model) that checks
prompt lengths haven't crept back:
{
"name": "token-watch-weekly",
"schedule": { "kind": "cron", "expr": "0 10 * * 1", "tz": "Asia/Shanghai" },
"payload": {
"kind": "agentTurn",
"model": "<cheapest-model>",
"message": "Check all cron prompt lengths. Flag any that grew >20% since last baseline.",
"timeoutSeconds": 120
},
"sessionTarget": "isolated",
"delivery": { "mode": "none" }
}
Run the full Phase 1-4 cycle every quarter using the cheapest available
model. Compare results against previous reports to spot regrowth trends.
The quarterly audit MUST include:
When token consumption suddenly spikes, it's usually one of three causes.
Know which one:
| Symptom | Likely Cause | Fix |
|---|---|---|
| --------- | ------------- | ----- |
| A specific cron doubled in token count | Prompt crept back (someone added preamble) | A1 Strip preamble |
| All crons increased proportionally | Model changed (e.g. dev switched back to pro from flash) | B1 Right-size, check default model |
| Session-level spikes, not cron | Tool profile expanded / new plugin registered | D1 Size tool schemas, check profile |
| Intermittent spikes | Prefix changed → KV cache missed | G1 Fixed prefix, check for variation |
If you have access to provider dashboards:
Some configuration paths require a gateway restart to take effect:
agents.defaults.heartbeat.* (edit config file + restart)agents.list[].tools.profilegateway., auth.plugins.* — certain sub-fieldsThese core mechanisms must be preserved even in an aggressive token budget:
openclaw gateway config CLIagents.list[].tools.profile in openclaw.jsontest-prompts.json in skill directoryresults.tsv in skill directoryThis section is populated by the first execution of the Token Saver in a
specific deployment. Replace the example values below with real ones.
| Item | Example Path |
|---|---|
| ------ | ------------- |
| Cron jobs | ~/.openclaw/cron/jobs.json |
| Gateway config | ~/.openclaw/openclaw.json |
| Workspace root | ~/.openclaw/workspace/ |
| Bootstrap files | AGENTS.md, SOUL.md, USER.md, MEMORY.md, HEARTBEAT.md, IDENTITY.md, TOOLS.md, STANDING-ORDERS.md |
| File | Initial Size | After First Pass | Reduction | Techniques Used |
|---|---|---|---|---|
| ------ | ------------- | ------------------ | ----------- | ---------------- |
| SOUL.md | 7,034 | 3,521 | -50% | C2 (cross-ref), C4 (one-liner), A2 |
| STANDING-ORDERS.md | 10,960 | 3,816 | -65% | C2 (cross-ref), A4 (remove redundancy) |
| IDENTITY.md | 6,228 | 4,313 | -31% | C2 (dedup with SOUL.md), C4 |
| AGENTS.md | 5,072 | 2,691 | -47% | C2 (ref to STANDING-ORDERS), C4 |
| TOOLS.md | 8,893 | 7,488 | -16% | C4 (remove stale entries) |
| MEMORY.md | 30,224 | 26,420 | -13% | C3 (archive promoted entries) |
| Total | 68,411 | 48,249 | -29% | — |
Per-session token savings from bootstrap compression: ~6,720 tokens.
| File Type | Typical Savings | Best Technique |
|---|---|---|
| ----------- | ---------------- | ---------------- |
| Program/Protocol (STANDING-ORDERS.md) | 55-65% | A4 (remove boilerplate sections) |
| Guide/Identity (SOUL.md, IDENTITY.md) | 30-50% | C2 (cross-reference dedup) |
| Instructions (AGENTS.md) | 40-50% | C2 (replace lists with file refs) |
| Knowledge base (MEMORY.md) | 10-20% | C3 (archive old entries only) |
| Config/state table (TOOLS.md) | 10-20% | C4 (remove stale entries only) |
| Task | Model Tier | Model |
|---|---|---|
| ------ | ----------- | ------- |
| Version check | Economy | minimax-m2.7 |
| Demand scanning | Standard | deepseek-v4-pro (needs search) |
| Domain probe | Economy | minimax-m2.7 |
| Dreaming (memory integration) | Economy | minimax-m2.7 |
| Doc maintenance | Economy | minimax-m2.7 |
| WaveCap daily expansion | Standard | deepseek-v4-pro (needs reasoning) |
| Weekly review | Premium | deepseek-v4-pro |
| Friday topic selection | Premium | deepseek-v4-pro |
| Main session | Standard | deepseek-v4-flash |
| Item | Reason | Condition to Revisit |
|---|---|---|
| ------ | -------- | --------------------- |
| Tool profile for main agent | High risk (may break unexpected features) | After sub-agent coding profile proven in production for 1 week |
| Cron task merging | Needs user confirmation; may affect reliability | Next token audit cycle |
| Compaction mode change (safeguard→normal) | Needs config reload | When gateway restarted for other reasons |
minimax-m2.7 (economy).
~/.openclaw/.共 2 个版本