概述

EvalScope

Read only the relevant reference file for the matched workflow — don't preload all of them.

Workflow	When	Reference
----------	------	-----------
Eval (accuracy)	evaluate / benchmark / score	eval-reference.md
Perf (stress test)	throughput / latency / QPS / perf	perf-reference.md
RAG Evaluation	RAG / embedding / retrieval quality	rag-reference.md
Visualization	view results / compare / dashboard	(below)
Benchmark Discovery	list / find / what benchmarks	(below)
Troubleshooting	errors / failures / debug	troubleshooting.md

Prerequisites

evalscope --version            # verify installation
pip install evalscope           # basic
pip install 'evalscope[all]'   # all backends (perf, rag, service, aigc)
pip install 'evalscope[perf]'  # perf only
pip install 'evalscope[rag]'   # RAG only (RAGAS, MTEB, CLIP)
pip install 'evalscope[service]' # Web dashboard

Decision Tree

User wants accuracy evaluation (evaluate / benchmark / score / 评测)
Local checkpoint path or HuggingFace/ModelScope ID → --model PATH (auto llm_ckpt)
API endpoint → --model NAME --api-url URL (auto openai_api)
Anthropic → --eval-type anthropic_api
LiteLLM multi-provider → --eval-type litellm --model provider/name
OpenAI Responses API → --eval-type openai_responses_api
Pipeline test → --model mock --eval-type mock_llm
Image generation → --eval-type text2image
TTS → --eval-type text2speech
Image editing → --eval-type image_editing
User wants performance test (throughput / latency / QPS / 压测)
→ evalscope perf workflow
API types: openai (default), local, local_vllm, dashscope, embedding, rerank, custom
User wants RAG evaluation (RAG / embedding quality / retrieval)
→ evalscope eval --eval-backend RAGEval with tool config
User wants visualization (view / compare / dashboard)
→ evalscope service
User wants benchmark info (list / find / what benchmarks / 有哪些评测集)
→ evalscope benchmark-info

Workflow 1: Eval (Accuracy)

Core command pattern:

# Local checkpoint
evalscope eval --model Qwen/Qwen2.5-0.5B-Instruct --datasets gsm8k --limit 10

# API endpoint (auto-detects openai_api when --api-url is set)
evalscope eval --model qwen-plus --datasets gsm8k arc \
  --api-url http://localhost:8000/v1/chat/completions --api-key sk-xxx --limit 10

# Anthropic
evalscope eval --model claude-3-5-sonnet --eval-type anthropic_api --datasets mmlu --api-key sk-ant-xxx

Key parameters: --datasets, --limit, --generation-config, --dataset-args, --eval-backend, --judge-strategy. For full parameter list → eval-reference.md.

Output: outputs//reports/*.json (scores), report.html (summary).

Workflow 2: Perf (Stress Test)

Core command pattern:

# Basic throughput test
evalscope perf --model qwen-plus \
  --url http://localhost:8000/v1/chat/completions --api openai \
  --dataset openqa --parallel 5 --number 200 --stream

# Concurrency gradient (--parallel and --number must pair)
evalscope perf --model qwen-plus --url http://localhost:8000/v1/chat/completions \
  --api openai --parallel 1 5 10 20 --number 50 250 500 1000 --stream

# Embedding model
evalscope perf --model text-embedding-v3 --url http://localhost:8000/v1/embeddings \
  --api embedding --parallel 10 --number 500

# Rerank model
evalscope perf --model bge-reranker --url http://localhost:8000/v1/rerank \
  --api rerank --parallel 5 --number 200

Key parameters: --parallel, --number, --dataset, --max-tokens, --sla-auto-tune. For full parameter list → perf-reference.md.

Output: console table (TTFT/TPOT/throughput p50-p99) + HTML report.

Workflow 3: RAG Evaluation

Uses --eval-backend RAGEval with a Python dict/YAML config. Three tools: RAGAS, MTEB, clip_benchmark.

from evalscope import run_task
run_task({
    'eval_backend': 'RAGEval',
    'eval_config': {
        'tool': 'MTEB',  # or 'RAGAS' or 'clip_benchmark'
        ...
    }
})

For config schemas and examples → rag-reference.md.

Workflow 4: Visualization

evalscope service --host 0.0.0.0 --port 9000 --outputs ./outputs

Options: --host (default 0.0.0.0), --port (default 9000), --outputs PATH (scan dir), --debug.

Requires: pip install 'evalscope[service]'.

Workflow 5: Benchmark Discovery

evalscope benchmark-info --list                    # all benchmarks
evalscope benchmark-info --list --tag Math Coding  # filter by tags (OR, case-insensitive)
evalscope benchmark-info gsm8k                     # text summary
evalscope benchmark-info gsm8k --format json       # structured JSON
evalscope benchmark-info gsm8k --format markdown   # full docs

Workflow 6: Sandbox Evaluation

For code-execution benchmarks (HumanEval, MBPP, etc.) with Docker isolation:

evalscope eval --model qwen-plus --datasets humaneval \
  --api-url http://localhost:8000/v1/chat/completions \
  --sandbox '{"enabled": true, "type": "docker"}'

Requires Docker daemon running. See evalscope eval --help for --sandbox schema.

Quick Lookup Table

For up-to-date results: evalscope benchmark-info --list --tag

User Need	Tags	Typical Benchmarks
-----------	------	--------------------
Math / reasoning	Math, Reasoning	gsm8k, math_500, aime24, competition_math
Coding	Coding	humaneval, mbpp, live_code_bench
General knowledge	Knowledge, MCQ	mmlu, ceval, cmmlu, mmlu_pro
Chinese	Chinese	ceval, cmmlu, chinese_simpleqa
Multimodal / vision	MultiModal	mmmu, mm_bench, math_vista
Instruction following	InstructionFollowing	ifeval, multi_if
Function calling	FunctionCalling	bfcl_v3, bfcl_v4
Long context	LongContext	needle_haystack, longbench_v2
Agent	Agent	tau_bench

Common suites:

General LLM: mmlu gsm8k bbh humaneval ifeval
Chinese: ceval cmmlu chinese_simpleqa
Multimodal: mmmu mm_bench math_vista mm_star

General Notes

Always --limit 5 for first-run validation
Default output: ./outputs//
Long runs: background + tail -f outputs//logs/eval_log.log
Resume interrupted runs: --use-cache outputs/
Full parameter help: evalscope eval --help / evalscope perf --help
On errors → troubleshooting.md

版本历史

共 2 个版本

v1.0.2 当前

2026-06-07 06:06 安全安全
v1.0.1

2026-03-31 03:54 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

EvalScope

概述

EvalScope

Prerequisites

Decision Tree

Workflow 1: Eval (Accuracy)

Workflow 2: Perf (Stress Test)

Workflow 3: RAG Evaluation

Workflow 4: Visualization

Workflow 5: Benchmark Discovery

Workflow 6: Sandbox Evaluation

Quick Lookup Table

General Notes

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Agent Browser

Find Skills

self-improving agent