Generic context skills assume a reliable, large-context cloud provider.
Local backends (llama.cpp, Ollama) have a different failure profile:
the KV cache is bounded by VRAM, the server can return 503 before
OpenClaw's compaction logic triggers, and the compaction model is the
same overloaded local model. This skill addresses that reality.
| Cloud provider | Local llama.cpp / Ollama |
|---|---|
| --- | --- |
| Context limit is a soft API error, OpenClaw retries after compaction | KV-cache fills up, server returns 503 or context length exceeded mid-request |
| Compaction uses same model, which is always available | Compaction uses same overloaded local model — may also fail |
| Context window is exactly what the API reports | Effective context = min(configured --ctx-size, available VRAM for KV cache) |
| No idle slot eviction | Idle slots can be evicted; server returns "Loading model" 503 on next request |
The practical consequence: on a GPU-constrained setup (e.g. a 24 GB card
running a 27B Q5 model), the usable KV-cache budget is roughly 5–8 GB.
At 32k tokens configured context, that fills up faster than the configured
limit suggests. Treat 50 % fill as amber and 70 % as red — not 60/80 %.
Before a long session, run this once to understand your actual headroom:
# Check VRAM headroom
nvidia-smi --query-gpu=memory.used,memory.free,memory.total \
--format=csv,noheader,nounits
# Check llama.cpp slot state
curl -s http://localhost:8081/slots | python3 -m json.tool
If memory.free is less than 4 GB, treat the session as already amber
regardless of what /status reports. Log the result to memory:
VRAM free: X MB — effective context budget: reduced
| Fill level | State | Action |
|---|---|---|
| --- | --- | --- |
| < 50 % | Green | Proceed normally |
| 50–69 % | Amber | Trim tool outputs, flush key facts to memory |
| 70–84 % | Red | Checkpoint, offer /compact before continuing |
| ≥ 85 % | Critical | Stop expanding. Compact or /new before next tool call |
Check /status at session start and after any tool call that returns
more than ~200 lines of output.
These are server-side errors, not OpenClaw compaction events.
They require a different response than a normal context overflow:
| Signal | Meaning |
|---|---|
| --- | --- |
HTTP 503 with body "loading model" | Idle slot was evicted; model is reloading. Wait 10–30 s, then retry once. |
HTTP 503 with body "no slot available" | All slots busy or KV cache full. Do NOT retry immediately — compact first. |
context length exceeded in error | Hard KV-cache overflow. Compact or start /new before any retry. |
| Sudden very slow response then timeout | KV cache thrashing — reduce context before next request. |
**Never retry a 503 "no slot available" or context overflow without first
reducing context.** Retrying makes the problem worse by sending the same
oversized payload again.
Before any task you expect to span more than 4 turns (file edits, debugging
sessions, multi-step setups):
/status — note current fill %.nvidia-smi if fill is already above 40 %.exec result ≈ 200–1500 tokensweb_fetch ≈ 1000–4000 tokensthe user upfront.
Apply these habits to every tool call in amber state:
# Instead of reading entire files:
sed -n '1,50p' /path/to/file # first 50 lines
grep -n "error\|warn\|fail" logfile # targeted grep
tail -100 /var/log/syslog # recent entries only
# Instead of verbose exec output:
some-command 2>&1 | tail -30
systemctl status service --no-pager --lines=20
# Summarise large outputs in one sentence, then discard them:
# "Command succeeded. Key values: port=8081, pid=12345"
Write key values to memory immediately after each tool call — do not
rely on them surviving a compaction summary intact.
## Checkpoint [timestamp]
Status: [what is done]
Pending: [what is next]
Critical values: [file paths, ports, error codes, config keys]
> ⚠️ Context at ~N % (local backend — conservative threshold).
> I've saved progress to memory. Recommend /compact Focus on [task]
> before continuing. Or /new for a clean session.
/compact Focus on — not bare /compact. The local model needs a focused instruction to produce
a useful summary under memory pressure.
Do not issue any more tool calls that expand context.
🛑 Context critical (~N %). Stopping to prevent a server error.
Done: [X]
Pending: [Y]
Key info: [Z]
Options:
/compact Focus on [task] — summarise and continue
/new — fresh session (I'll reload from memory)
If the server already returned an error before you could act:
"loading model" → wait 15–30 s, then retry once with a minimal message."no slot available" or context length exceeded → compact first./compact Focus on [what you were doing].```bash
curl -s http://localhost:8081/health
# expect: {"status":"ok"}
```
do not trust the compaction summary to have preserved them verbatim.
session before loading more context.
Without a dedicated compaction model, OpenClaw uses the same local model
for summarisation — the identical model whose KV cache just caused the
overflow. This means compaction will likely fail or produce a degraded
summary. **A separate compaction model is a prerequisite for this skill
to work reliably, not an optional optimisation.**
The compaction model should run on a different machine or a second
inference instance with its own memory budget. It does not need to be
powerful — it only needs to summarise text faithfully and follow
instructions. A 7B–8B model is sufficient.
Recommended model: qwen2.5:7b via Ollama (fits in ~5 GB RAM/VRAM,
fast, excellent at summarisation and instruction-following).
Fallback if speed is critical: llama3.2:3b (~2 GB).
{
"agents": {
"defaults": {
"compaction": {
"model": "ollama/qwen2.5:7b",
"notifyUser": true,
"memoryFlush": {
"model": "ollama/qwen2.5:7b"
}
}
}
},
"providers": {
"ollama": {
"baseUrl": "http://<COMPACTION-SERVER-IP>:11434"
}
}
}
Without this configuration, the skill provides partial benefit only:
the conservative thresholds and lean tool habits reduce overflow frequency,
but cannot recover reliably once an overflow occurs.
| Command | When to use |
|---|---|
| --- | --- |
/status | Check fill % — use at session start and after large tool outputs |
/context list | See which injected files and skills consume the most tokens |
/compact Focus on | Guided compaction — always specify focus on a local backend |
/new | Clean slate — fastest recovery when context is critical |
/usage tokens | Per-reply token counter — useful for calibrating estimates |
| Skill | When to use instead |
|---|---|
| --- | --- |
context-recovery | After compaction on any backend — recovers lost context via channel history |
context-budgeting | Cloud providers or stable local setups — heartbeat-based GC at >80 % |
context-clean-up | Diagnosing chronic context bloat — ranked offender audit |
context-anchor | Post-compaction orientation via memory file scan |
Use local-inference-context before problems occur and context-recovery
after compaction if context was lost.
共 1 个版本