← 返回
未分类 中文

Ollama Model Pilot

Use this skill when the user wants to test, compare, promote, replace, or clean up local Ollama models with a repeatable two-round real-task benchmark, no-th...
Use this skill when the user wants to test, compare, promote, replace, or clean up local Ollama models with a repeatable two-round real-task benchmark, no-th...
patmenciu patmenciu 来源
未分类 clawhub v1.5.0 4 版本 100000 Key: 无需
★ 1
Stars
📥 373
下载
💾 0
安装
4
版本
#latest

概述

ModelPilot

ModelPilot is a local-only protocol for testing, comparing, promoting, replacing,

and cleaning up Ollama models. It is designed for real work decisions, not leaderboard

claims.

Safety Boundary

Always keep the workflow local unless the user explicitly authorizes otherwise.

  • Do not call cloud model APIs.
  • Do not upload files, prompts, logs, paths, configs, or benchmark outputs.
  • Do not download, pull, install, upgrade, or delete models without explicit user approval.
  • Do not use real private documents as benchmark samples unless the user explicitly names the file for this task.
  • Use fictional examples for tests, documentation, and demos.
  • Treat model cleanup as a workflow dependency audit, not a disk-space optimization task.

Trigger Conditions

Use this skill when the user asks to:

  • test an Ollama model
  • compare local models
  • decide whether a new model can replace an existing model
  • verify no-think behavior
  • build a local model benchmark report
  • audit installed models before cleanup
  • choose local models for coding, writing, RAG, automation, or structured output

Test Levels

Classify the task before running anything.

  1. Smoke Test

Confirm the model is installed, runnable, and responsive.

  1. Speed Benchmark

Measure startup time, generation time, output length, and failure rate.

  1. Real-Task Benchmark

Use task-like prompts that match the user's actual workflow. Prefer fixed prompt

sets so results are comparable across models.

  1. Promotion Test

Decide whether a model can replace an existing workflow model. A promotion test

requires two independent benchmark rounds.

Two-Round Replacement Rule

Do not recommend replacing a working model after a single run.

  • Round 1 checks: runnable, speed, output format, obvious quality failures, no-think leakage.
  • Round 2 checks: same prompt set, same model, repeatability, quality consistency, failure modes.
  • A model is only replacement-ready when both rounds pass the required tasks.
  • Keep the previous model and configuration available for rollback.
  • If structured output, no-think behavior, or long-context handling is unstable, do not use the model in automation.

Fixed Prompt Set

Prefer a stable prompt file with fictional data. Include at least:

  • short Chinese or English Q&A
  • long-document summary
  • structured JSON or Markdown output
  • real-role workflow simulation
  • no-think verification prompt

The benchmark prompt set should be reused across candidate models. Do not compare

models using different tasks unless the report clearly says so.

No-Think Verification

Never assume a model is no-think just because the model name contains nothink.

Check:

  • model output does not include , , reasoning traces, or hidden-analysis markers
  • CLI or API flags are actually accepted by the runtime
  • Modelfile-level instructions are treated as weak constraints, not proof
  • structured outputs remain clean when no-think is enabled

If no-think fails, the model may still be useful for manual work, but it should not

be promoted into automated workflows that require clean output.

Standard Workflow

  1. Identify the current model, candidate model, task type, and replacement target.
  2. Confirm whether the user wants smoke, speed, real-task, or promotion testing.
  3. Build or reuse a fixed prompt set with fictional data unless the user explicitly authorizes a real file.
  4. Run two benchmark rounds for replacement decisions.
  5. Review mechanical results: failures, duration, output length, format checks, no-think leakage.
  6. Review semantic quality manually for real-task tasks.
  7. Return a concise decision: keep, observe, replace, or not recommended.
  8. Include rollback advice when replacement is recommended.

Suggested Local Scripts

Use scripts only when they are available in this skill folder and fit the task.

  • scripts/modelpilot_benchmark.py: run local Ollama benchmark rounds and write JSON results.
  • scripts/modelpilot_report.py: summarize benchmark JSON into a Markdown decision report.

Do not run scripts that download models, install dependencies, or call remote APIs.

Required Response Format

When reporting results, include:

## ModelPilot Result

### Scope
-

### Models Tested
-

### Test Rounds
-

### Key Findings
-

### No-Think Check
-

### Replacement Decision
-

### Risks and Limits
-

### Rollback Advice
-

版本历史

共 4 个版本

  • v1.5.0 当前
    2026-06-09 18:13
  • v1.4.1
    2026-05-29 21:16 安全 安全
  • v1.4.0
    2026-05-29 13:52
  • v1.3.0
    2026-05-28 13:42 安全 安全

安全检测

腾讯云安全 (Keen)

队列中

腾讯云安全 (Sanbu)

队列中

🔗 相关推荐

it-ops-security

MoltGuard - Security & Antivirus & Guardrails

thomas-security
MoltGuard — OpenClaw 安全守卫,由 OpenGuardrails 提供。安装后可防止您和您的用户受到提示注入、数据泄露及恶意行为的侵害。
★ 116 📥 30,970
it-ops-security

OpenClaw Backup

alex3alex
备份与恢复 OpenClaw 数据。适用于创建备份、设置自动备份计划、从备份恢复或管理备份轮转。处理 ~/.openclaw 目录归档并包含适当的排除规则。
★ 90 📥 31,005
it-ops-security

Free Ride - Unlimited free AI

shaivpidadi
管理OpenClaw的OpenRouter免费AI模型,自动按质量排名模型,配置速率限制备用方案,并更新opencla...
★ 471 📥 78,219