← 返回
未分类

EvalScope

LLM evaluation & inference performance testing via the evalscope CLI. Translates natural language requests into evalscope commands for: (1) Model accuracy ev...
通过 evalscope CLI 进行 LLM 评估和推理性能测试。将自然语言请求转换为 evalscope 命令,用于:(1)模型准确性评估...
yunnglin yunnglin 来源
未分类 clawhub v1.0.2 2 版本 99804.7 Key: 无需
★ 2
Stars
📥 471
下载
💾 3
安装
2
版本
#latest

概述

EvalScope

Read only the relevant reference file for the matched workflow — don't preload all of them.

WorkflowWhenReference
---------------------------
Eval (accuracy)evaluate / benchmark / scoreeval-reference.md
Perf (stress test)throughput / latency / QPS / perfperf-reference.md
RAG EvaluationRAG / embedding / retrieval qualityrag-reference.md
Visualizationview results / compare / dashboard(below)
Benchmark Discoverylist / find / what benchmarks(below)
Troubleshootingerrors / failures / debugtroubleshooting.md

Prerequisites

evalscope --version            # verify installation
pip install evalscope           # basic
pip install 'evalscope[all]'   # all backends (perf, rag, service, aigc)
pip install 'evalscope[perf]'  # perf only
pip install 'evalscope[rag]'   # RAG only (RAGAS, MTEB, CLIP)
pip install 'evalscope[service]' # Web dashboard

Decision Tree

  • User wants accuracy evaluation (evaluate / benchmark / score / 评测)
  • Local checkpoint path or HuggingFace/ModelScope ID → --model PATH (auto llm_ckpt)
  • API endpoint → --model NAME --api-url URL (auto openai_api)
  • Anthropic → --eval-type anthropic_api
  • LiteLLM multi-provider → --eval-type litellm --model provider/name
  • OpenAI Responses API → --eval-type openai_responses_api
  • Pipeline test → --model mock --eval-type mock_llm
  • Image generation → --eval-type text2image
  • TTS → --eval-type text2speech
  • Image editing → --eval-type image_editing
  • User wants performance test (throughput / latency / QPS / 压测)
  • evalscope perf workflow
  • API types: openai (default), local, local_vllm, dashscope, embedding, rerank, custom
  • User wants RAG evaluation (RAG / embedding quality / retrieval)
  • evalscope eval --eval-backend RAGEval with tool config
  • User wants visualization (view / compare / dashboard)
  • evalscope service
  • User wants benchmark info (list / find / what benchmarks / 有哪些评测集)
  • evalscope benchmark-info

Workflow 1: Eval (Accuracy)

Core command pattern:

# Local checkpoint
evalscope eval --model Qwen/Qwen2.5-0.5B-Instruct --datasets gsm8k --limit 10

# API endpoint (auto-detects openai_api when --api-url is set)
evalscope eval --model qwen-plus --datasets gsm8k arc \
  --api-url http://localhost:8000/v1/chat/completions --api-key sk-xxx --limit 10

# Anthropic
evalscope eval --model claude-3-5-sonnet --eval-type anthropic_api --datasets mmlu --api-key sk-ant-xxx

Key parameters: --datasets, --limit, --generation-config, --dataset-args, --eval-backend, --judge-strategy. For full parameter list → eval-reference.md.

Output: outputs//reports/*.json (scores), report.html (summary).

Workflow 2: Perf (Stress Test)

Core command pattern:

# Basic throughput test
evalscope perf --model qwen-plus \
  --url http://localhost:8000/v1/chat/completions --api openai \
  --dataset openqa --parallel 5 --number 200 --stream

# Concurrency gradient (--parallel and --number must pair)
evalscope perf --model qwen-plus --url http://localhost:8000/v1/chat/completions \
  --api openai --parallel 1 5 10 20 --number 50 250 500 1000 --stream

# Embedding model
evalscope perf --model text-embedding-v3 --url http://localhost:8000/v1/embeddings \
  --api embedding --parallel 10 --number 500

# Rerank model
evalscope perf --model bge-reranker --url http://localhost:8000/v1/rerank \
  --api rerank --parallel 5 --number 200

Key parameters: --parallel, --number, --dataset, --max-tokens, --sla-auto-tune. For full parameter list → perf-reference.md.

Output: console table (TTFT/TPOT/throughput p50-p99) + HTML report.

Workflow 3: RAG Evaluation

Uses --eval-backend RAGEval with a Python dict/YAML config. Three tools: RAGAS, MTEB, clip_benchmark.

from evalscope import run_task
run_task({
    'eval_backend': 'RAGEval',
    'eval_config': {
        'tool': 'MTEB',  # or 'RAGAS' or 'clip_benchmark'
        ...
    }
})

For config schemas and examples → rag-reference.md.

Workflow 4: Visualization

evalscope service --host 0.0.0.0 --port 9000 --outputs ./outputs

Options: --host (default 0.0.0.0), --port (default 9000), --outputs PATH (scan dir), --debug.

Requires: pip install 'evalscope[service]'.

Workflow 5: Benchmark Discovery

evalscope benchmark-info --list                    # all benchmarks
evalscope benchmark-info --list --tag Math Coding  # filter by tags (OR, case-insensitive)
evalscope benchmark-info gsm8k                     # text summary
evalscope benchmark-info gsm8k --format json       # structured JSON
evalscope benchmark-info gsm8k --format markdown   # full docs

Workflow 6: Sandbox Evaluation

For code-execution benchmarks (HumanEval, MBPP, etc.) with Docker isolation:

evalscope eval --model qwen-plus --datasets humaneval \
  --api-url http://localhost:8000/v1/chat/completions \
  --sandbox '{"enabled": true, "type": "docker"}'

Requires Docker daemon running. See evalscope eval --help for --sandbox schema.

Quick Lookup Table

For up-to-date results: evalscope benchmark-info --list --tag

User NeedTagsTypical Benchmarks
-------------------------------------
Math / reasoningMath, Reasoninggsm8k, math_500, aime24, competition_math
CodingCodinghumaneval, mbpp, live_code_bench
General knowledgeKnowledge, MCQmmlu, ceval, cmmlu, mmlu_pro
ChineseChineseceval, cmmlu, chinese_simpleqa
Multimodal / visionMultiModalmmmu, mm_bench, math_vista
Instruction followingInstructionFollowingifeval, multi_if
Function callingFunctionCallingbfcl_v3, bfcl_v4
Long contextLongContextneedle_haystack, longbench_v2
AgentAgenttau_bench

Common suites:

  • General LLM: mmlu gsm8k bbh humaneval ifeval
  • Chinese: ceval cmmlu chinese_simpleqa
  • Multimodal: mmmu mm_bench math_vista mm_star

General Notes

  • Always --limit 5 for first-run validation
  • Default output: ./outputs//
  • Long runs: background + tail -f outputs//logs/eval_log.log
  • Resume interrupted runs: --use-cache outputs/
  • Full parameter help: evalscope eval --help / evalscope perf --help
  • On errors → troubleshooting.md

版本历史

共 2 个版本

  • v1.0.2 当前
    2026-06-07 06:06 安全 安全
  • v1.0.1
    2026-03-31 03:54 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Agent Browser

rez0
用于 AI 代理的浏览器自动化 CLI。当用户需要与网站交互(包括浏览页面、填写表单、点击按钮、截图等)时使用。
★ 835 📥 310,337
ai-agent

Find Skills

guipi888
场景驱动+关键词双模式技能发现工具。当用户用自然语言描述场景/需求(如"我想做一个海报""帮我分析股票"),或明确说"安装技能/find skills/找个skill"时,自动从官方内置、本地已安装、SkillHub、虾评、GitHub、C
★ 1,463 📥 526,365
ai-agent

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,099 📥 827,083