← 返回
未分类 中文

Local Inference Context

Context management for self-hosted LLM backends (llama.cpp, Ollama). Prevents mid-task 503 errors and context overflows caused by VRAM-limited KV caches. Use...
针对自托管 LLM 后端(llama.cpp、Ollama)的上下文管理,防止因显存限制的 KV 缓存导致的任务中途 503 错误和上下文溢出。使用...
joekravelli joekravelli 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 295
下载
💾 0
安装
1
版本
#latest

概述

Local Inference Context

Generic context skills assume a reliable, large-context cloud provider.

Local backends (llama.cpp, Ollama) have a different failure profile:

the KV cache is bounded by VRAM, the server can return 503 before

OpenClaw's compaction logic triggers, and the compaction model is the

same overloaded local model. This skill addresses that reality.


Why local backends fail differently

Cloud providerLocal llama.cpp / Ollama
------
Context limit is a soft API error, OpenClaw retries after compactionKV-cache fills up, server returns 503 or context length exceeded mid-request
Compaction uses same model, which is always availableCompaction uses same overloaded local model — may also fail
Context window is exactly what the API reportsEffective context = min(configured --ctx-size, available VRAM for KV cache)
No idle slot evictionIdle slots can be evicted; server returns "Loading model" 503 on next request

The practical consequence: on a GPU-constrained setup (e.g. a 24 GB card

running a 27B Q5 model), the usable KV-cache budget is roughly 5–8 GB.

At 32k tokens configured context, that fills up faster than the configured

limit suggests. Treat 50 % fill as amber and 70 % as red — not 60/80 %.


Calibrating your effective context budget

Before a long session, run this once to understand your actual headroom:

# Check VRAM headroom
nvidia-smi --query-gpu=memory.used,memory.free,memory.total \
  --format=csv,noheader,nounits

# Check llama.cpp slot state
curl -s http://localhost:8081/slots | python3 -m json.tool

If memory.free is less than 4 GB, treat the session as already amber

regardless of what /status reports. Log the result to memory:

VRAM free: X MB — effective context budget: reduced

Thresholds for local backends

Fill levelStateAction
---------
< 50 %GreenProceed normally
50–69 %AmberTrim tool outputs, flush key facts to memory
70–84 %RedCheckpoint, offer /compact before continuing
≥ 85 %CriticalStop expanding. Compact or /new before next tool call

Check /status at session start and after any tool call that returns

more than ~200 lines of output.


Recognising a local backend failure

These are server-side errors, not OpenClaw compaction events.

They require a different response than a normal context overflow:

SignalMeaning
------
HTTP 503 with body "loading model"Idle slot was evicted; model is reloading. Wait 10–30 s, then retry once.
HTTP 503 with body "no slot available"All slots busy or KV cache full. Do NOT retry immediately — compact first.
context length exceeded in errorHard KV-cache overflow. Compact or start /new before any retry.
Sudden very slow response then timeoutKV cache thrashing — reduce context before next request.

**Never retry a 503 "no slot available" or context overflow without first

reducing context.** Retrying makes the problem worse by sending the same

oversized payload again.


Pre-task checklist for long operations

Before any task you expect to span more than 4 turns (file edits, debugging

sessions, multi-step setups):

  1. Run /status — note current fill %.
  2. Check nvidia-smi if fill is already above 40 %.
  3. Estimate token cost of the task:
    • Each file read ≈ 500–3000 tokens depending on file size
    • Each exec result ≈ 200–1500 tokens
    • Each web_fetch ≈ 1000–4000 tokens
  4. If estimated total would push past 70 %, split into phases and tell

the user upfront.


Amber state (50–69 %): lean tool hygiene

Apply these habits to every tool call in amber state:

# Instead of reading entire files:
sed -n '1,50p' /path/to/file          # first 50 lines
grep -n "error\|warn\|fail" logfile   # targeted grep
tail -100 /var/log/syslog             # recent entries only

# Instead of verbose exec output:
some-command 2>&1 | tail -30
systemctl status service --no-pager --lines=20

# Summarise large outputs in one sentence, then discard them:
# "Command succeeded. Key values: port=8081, pid=12345"

Write key values to memory immediately after each tool call — do not

rely on them surviving a compaction summary intact.


Red state (70–84 %): checkpoint before continuing

  1. Write a checkpoint to memory now:
## Checkpoint [timestamp]
Status: [what is done]
Pending: [what is next]
Critical values: [file paths, ports, error codes, config keys]
  1. Tell the user:

> ⚠️ Context at ~N % (local backend — conservative threshold).

> I've saved progress to memory. Recommend /compact Focus on [task]

> before continuing. Or /new for a clean session.

  1. If continuing, use /compact Focus on — not bare

/compact. The local model needs a focused instruction to produce

a useful summary under memory pressure.


Critical state (≥ 85 %): stop and recover

Do not issue any more tool calls that expand context.

  1. Write the checkpoint (see above).
  2. Send the user a recovery message:
🛑 Context critical (~N %). Stopping to prevent a server error.

Done: [X]
Pending: [Y]
Key info: [Z]

Options:
  /compact Focus on [task]   — summarise and continue
  /new                       — fresh session (I'll reload from memory)
  1. Wait for the user to choose. Do not attempt to continue on your own.

After a 503 or context-overflow error

If the server already returned an error before you could act:

  1. Do not panic and do not retry the same request.
  2. Check the error type:
    • "loading model" → wait 15–30 s, then retry once with a minimal message.
    • "no slot available" or context length exceeded → compact first.
  3. Run /compact Focus on [what you were doing].
  4. After compaction, verify the slot is ready:

```bash

curl -s http://localhost:8081/health

# expect: {"status":"ok"}

```

  1. Re-read any file paths or config values from memory or disk —

do not trust the compaction summary to have preserved them verbatim.

  1. Resume with a short, targeted first message to re-establish the

session before loading more context.


Compaction model — required, not optional

Without a dedicated compaction model, OpenClaw uses the same local model

for summarisation — the identical model whose KV cache just caused the

overflow. This means compaction will likely fail or produce a degraded

summary. **A separate compaction model is a prerequisite for this skill

to work reliably, not an optional optimisation.**

The compaction model should run on a different machine or a second

inference instance with its own memory budget. It does not need to be

powerful — it only needs to summarise text faithfully and follow

instructions. A 7B–8B model is sufficient.

Recommended model: qwen2.5:7b via Ollama (fits in ~5 GB RAM/VRAM,

fast, excellent at summarisation and instruction-following).

Fallback if speed is critical: llama3.2:3b (~2 GB).

{
  "agents": {
    "defaults": {
      "compaction": {
        "model": "ollama/qwen2.5:7b",
        "notifyUser": true,
        "memoryFlush": {
          "model": "ollama/qwen2.5:7b"
        }
      }
    }
  },
  "providers": {
    "ollama": {
      "baseUrl": "http://<COMPACTION-SERVER-IP>:11434"
    }
  }
}

Without this configuration, the skill provides partial benefit only:

the conservative thresholds and lean tool habits reduce overflow frequency,

but cannot recover reliably once an overflow occurs.


Slash command reference

CommandWhen to use
------
/statusCheck fill % — use at session start and after large tool outputs
/context listSee which injected files and skills consume the most tokens
/compact Focus on Guided compaction — always specify focus on a local backend
/newClean slate — fastest recovery when context is critical
/usage tokensPer-reply token counter — useful for calibrating estimates

Relationship to other skills

SkillWhen to use instead
------
context-recoveryAfter compaction on any backend — recovers lost context via channel history
context-budgetingCloud providers or stable local setups — heartbeat-based GC at >80 %
context-clean-upDiagnosing chronic context bloat — ranked offender audit
context-anchorPost-compaction orientation via memory file scan

Use local-inference-context before problems occur and context-recovery

after compaction if context was lost.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-08 02:58 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

life-service

Bring! Shoppinglist

joekravelli
使用 Bring! API 通过 CLI 管理 Bring! 购物清单,可添加、删除、完成和查看项目或清单。
★ 0 📥 383
it-ops-security

1password

steipete
设置和使用 1Password CLI (op)。适用于:安装 CLI、启用桌面应用集成、登录(单/多账户)、通过 op 读取/注入/运行密钥。
★ 53 📥 31,524
it-ops-security

MoltGuard - Security & Antivirus & Guardrails

thomaslwang
MoltGuard — OpenClaw 安全守卫,由 OpenGuardrails 提供。安装 MoltGuard,保护您和您的用户免受提示注入、数据泄露和恶意攻击。
★ 116 📥 30,861