Cut token usage without cutting quality. Every technique below is battle-tested in production Claude Code, OpenClaw, and agentic systems.
Target: under 500 tokens for primary config files.
# CLAUDE.md / AGENTS.md template (good — ~150 tokens)
## Rules
- TypeScript strict mode
- Test every new function
- Follow existing patterns
## Key Files
- API routes: src/api/README.md
- DB schema: docs/schema.md
- Style guide: docs/style-guide.md
Principles:
Never load everything upfront. Use a three-tier system:
| Tier | When loaded | Token cost |
|---|---|---|
| ------ | ------------ | ------------ |
| Metadata (name+description) | Always | ~100 words |
| Core instructions (SKILL.md body) | On trigger | <5K words |
| Reference files | On demand | Unlimited |
Implementation patterns:
File pointers over inline content:
# Bad: inline everything
## API Reference
[2000 lines of API docs]
# Good: pointer + on-demand load
## API Reference
See docs/api.md — load only when working on API endpoints.
Domain-split references:
skill/
├── SKILL.md (core workflow only)
└── references/
├── aws.md # load only for AWS tasks
├── gcp.md # load only for GCP tasks
└── azure.md # load only for Azure tasks
Conditional loading via grep patterns:
For large reference files, include search patterns in SKILL.md:
# BigQuery Metrics
See references/metrics.md. Search patterns:
- Revenue queries: grep "revenue|billing" references/metrics.md
- User analytics: grep "cohort|retention|churn" references/metrics.md
When to compact vs clear:
/compact — Context is long but thread is still relevant. Summarizes and restarts from summary./clear — Switching tasks entirely. Wipes everything. Clean slate.Rules:
Tools are context too. Every tool definition loads on every request.
Principles:
head -20 file.log (10 tokens) vs MCP JSON response (1000+ tokens)Tool output compression:
# Bad: full JSON dump
{"status":"success","data":{"items":[...500 lines...],"meta":{"page":1,"total":847}}}
# Good: structured summary
"847 items found. First 5: [names]. See full results in .cache/search.json"
Select relevant sentences, discard the rest. Best for narrative documents.
# Before: 500 tokens
Customer John reported unstable internet for 3 days with video call disruptions.
Support ticket #4521 opened on 2026-04-15. Multiple attempts to reset router failed.
# After (extractive): 30 tokens
John: unstable internet 3 days, video call disruptions, router reset failed.
Keep or discard entire chunks. Best for factual/citation-heavy content. Zero rewrite cost.
# Approach: filter chunks by relevance score, only pass top-k to model
relevant_chunks = [c for c in retrieved if c.score > 0.75][:3]
Uses a small model to remove low-information tokens. Up to 20x compression with <2% quality loss.
llmlingua (Python), LLMLingua2Across context:
Across sessions:
memory/*.md) for cross-session continuity instead of re-explainingAcross tools:
jq to extract specific fields instead of loading full JSONgrep ERROR log.txt | head -5 instead of cat log.txtPrompt caching (provider-level):
Sub-agent architecture:
Main agent (clean context, ~2K tokens)
├── Sub-agent 1: deep research (uses 50K tokens, returns 1K summary)
├── Sub-agent 2: code generation (uses 30K tokens, returns diff)
└── Sub-agent 3: testing (uses 20K tokens, returns pass/fail + details)
Each sub-agent explores extensively but returns only distilled results. Main agent stays lean.
Model-tiering:
OpenClaw:
references/.jq, head, tail, grep to scope output before it enters context.Claude Code:
.claudeignore — Exclude node_modules, build artifacts, lock files, data filescontext-mode MCP plugin — Automatically compresses MCP tool outputs/compact after 20-30 messages or when switching sub-tasks/model to switch tiers mid-session (Haiku for lookups, Sonnet for implementation, Opus for architecture)MAX_THINKING_TOKENS=8000 env var to cap extended thinking budgetGeneral (applies to all agents):
.gitignore pattern for AI: exclude binaries, media, large generated files--no-ask-user / --allow-all flags reduce confirmation round-tripsRun this against any AI project:
Token Audit:
□ System prompt / config files under 500 tokens?
□ Reference docs in separate files, not inline?
□ Tool outputs scoped (jq/head/grep), not raw dumps?
□ Sessions reset between topics?
□ MCP servers limited to active ones only?
□ Few-shot examples: 3 diverse > 10 similar?
□ Sub-agents used for deep exploration work?
□ Model tier matches task complexity?
□ Caching enabled for static prompt prefixes?
□ Deduplication: no repeated instructions across context layers?
□ .claudeignore / .gitignore excluding non-essential files?
□ Extended thinking budget capped for simple tasks?
| Strategy | Effort | Savings | Best For |
|---|---|---|---|
| ---------- | -------- | --------- | ---------- |
| Slash system prompt | Low | 10-30% baseline | Every project |
| Selective loading | Medium | 40-70% per-query | Multi-domain agents |
| Compaction | Low | 50-80% long sessions | Coding agents |
| Efficient tools | Medium | 50-90% MCP usage | Tool-heavy workflows |
| Prompt compression | High | Up to 20x on docs | RAG, research agents |
| Deduplication | Low | 10-25% per session | All agents |
| Caching & sub-agents | High | 30-60% overall | Production systems |
| Model tiering | Low | 3-5x per-query cost | All multi-model setups |
For deeper implementation details — LLMLingua integration code, LangChain compression pipelines, TikToken measurement, sub-agent architecture patterns, semantic deduplication, and dollar savings formulas — see references/compression-deep-dive.md.
共 1 个版本