Reduce your LLM API costs by 20-35% with three proven mechanisms: pre-send token estimation, structured memory extraction, and context compression. Model-agnostic, zero dependencies.
Estimate token count before sending a request. If the payload exceeds a threshold, compress or truncate it. Never pay for tokens you could have avoided.
tokens ≈ character_count / 4tokens ≈ character_count / 2tokens ≈ character_count / 3.5tokens ≈ 2000 (flat per asset, regardless of size)Input: 24,000 characters of plain text
Estimated tokens: 24000 / 4 = 6,000 → under budget, send as-is.
Input: 40,000 characters of JSON
Estimated tokens: 40000 / 2 = 20,000 → over budget.
Action: strip null fields, remove redundant nested objects → 14,000 chars → 7,000 tokens → send.
See references/token-formula.md for the full formula breakdown with worked examples.
Instead of re-reading the entire conversation history every turn, extract and persist key information into structured memory files. On subsequent turns, load only the memory index — not the raw history.
MEMORY.md — index file, max 200 lines. Contains only pointers: - topic-name — one-line description.memory/topic-name.md — full content for each topic with frontmatter (name, description, type).user — who the user is, their preferences, expertise level.feedback — corrections and confirmed approaches (what to do / not do).project — current goals, deadlines, decisions, constraints.reference — pointers to external resources (URLs, dashboards, issue trackers).You are a memory extraction agent. Read the following new messages (since cursor position {cursor}).
For each piece of non-obvious information, output a JSON object:
{
"topic": "short-kebab-case-name",
"type": "user | feedback | project | reference",
"description": "one-line summary for the index",
"content": "full memory content, structured with Why and How-to-apply"
}
Rules:
- Max 5 memories per pass.
- Skip anything derivable from code, git, or existing memory.
- Convert relative dates to absolute (today is {date}).
- If a memory already exists for this topic, output an update, not a duplicate.
See references/memory-extraction-pattern.md for the full pattern with prompt templates.
As conversations grow, compress older exchanges into dense summaries. Keep only the last N messages in full fidelity. This prevents context windows from filling with stale reasoning.
block at the top of the conversation. Format:```
## Decisions Made
## Current State
## Key Constraints
```
Before compression:
42 messages, ~32,000 tokens total.
After compression:
Compressed block: ~2,000 tokens.
Last 6 messages: ~4,500 tokens.
Total: ~6,500 tokens.
Savings: 32,000 - 6,500 = 25,500 tokens (80% reduction on history).
Per-request savings (ongoing): ~25,500 tokens × $0.003/1K = $0.077 per request.
| Mechanism | Typical Savings | When It Hits |
|---|---|---|
| --- | --- | --- |
| Pre-send estimation | 10-15% | Every request with large payloads |
| Memory extraction | 5-10% | Multi-session workflows |
| Context compression | 15-25% | Long conversations (>20 messages) |
| Combined | 20-35% | Sustained usage over a session |
These are conservative estimates based on real-world agent workflows. Actual savings depend on conversation length, payload sizes, and how aggressively you compress.
SKILL.md into your system prompt).No code to install. No dependencies. Just rules your agent follows.
共 1 个版本