← 返回
AI智能 Key

Auto Arena

Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects resp...
无需预设测试数据,自动评估并对比多个AI模型或智能体。基于任务描述生成测试查询并收集响应。
helloml0326
AI智能 clawhub v1.0.0 1 版本 100000 Key: 需要
★ 0
Stars
📥 471
下载
💾 5
安装
1
版本
#latest

概述

Auto Arena Skill

End-to-end automated model comparison using the OpenJudge AutoArenaPipeline:

  1. Generate queries — LLM creates diverse test queries from task description
  2. Collect responses — query all target endpoints concurrently
  3. Generate rubrics — LLM produces evaluation criteria from task + sample queries
  4. Pairwise evaluation — judge model compares every model pair (with position-bias swap)
  5. Analyze & rank — compute win rates, win matrix, and rankings
  6. Report & charts — Markdown report + win-rate bar chart + optional matrix heatmap

Prerequisites

# Install OpenJudge
pip install py-openjudge

# Extra dependency for auto_arena (chart generation)
pip install matplotlib

Gather from user before running

InfoRequired?Notes
------------------------
Task descriptionYesWhat the models/agents should do (set in config YAML)
Target endpointsYesAt least 2 OpenAI-compatible endpoints to compare
Judge endpointYesStrong model for pairwise evaluation (e.g. gpt-4, qwen-max)
API keysYesEnv vars: OPENAI_API_KEY, DASHSCOPE_API_KEY, etc.
Number of queriesNoDefault: 20
Seed queriesNoExample queries to guide generation style
System promptsNoPer-endpoint system prompts
Output directoryNoDefault: ./evaluation_results
Report languageNo"zh" (default) or "en"

Quick start

CLI

# Run evaluation
python -m cookbooks.auto_arena --config config.yaml --save

# Use pre-generated queries
python -m cookbooks.auto_arena --config config.yaml \
  --queries_file queries.json --save

# Start fresh, ignore checkpoint
python -m cookbooks.auto_arena --config config.yaml --fresh --save

# Re-run only pairwise evaluation with new judge model
# (keeps queries, responses, and rubrics)
python -m cookbooks.auto_arena --config config.yaml --rerun-judge --save

Python API

import asyncio
from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline

async def main():
    pipeline = AutoArenaPipeline.from_config("config.yaml")
    result = await pipeline.evaluate()

    print(f"Best model: {result.best_pipeline}")
    for rank, (model, win_rate) in enumerate(result.rankings, 1):
        print(f"{rank}. {model}: {win_rate:.1%}")

asyncio.run(main())

Minimal Python API (no config file)

import asyncio
from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline
from cookbooks.auto_arena.schema import OpenAIEndpoint

async def main():
    pipeline = AutoArenaPipeline(
        task_description="Customer service chatbot for e-commerce",
        target_endpoints={
            "gpt4": OpenAIEndpoint(
                base_url="https://api.openai.com/v1",
                api_key="sk-...",
                model="gpt-4",
            ),
            "qwen": OpenAIEndpoint(
                base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
                api_key="sk-...",
                model="qwen-max",
            ),
        },
        judge_endpoint=OpenAIEndpoint(
            base_url="https://api.openai.com/v1",
            api_key="sk-...",
            model="gpt-4",
        ),
        num_queries=20,
    )
    result = await pipeline.evaluate()
    print(f"Best: {result.best_pipeline}")

asyncio.run(main())

CLI options

FlagDefaultDescription
----------------------------
--configPath to YAML configuration file (required)
--output_dirconfig valueOverride output directory
--queries_filePath to pre-generated queries JSON (skip generation)
--saveFalseSave results to file
--freshFalseStart fresh, ignore checkpoint
--rerun-judgeFalseRe-run pairwise evaluation only (keep queries/responses/rubrics)

Minimal config file

task:
  description: "Academic GPT assistant for research and writing tasks"

target_endpoints:
  model_v1:
    base_url: "https://api.openai.com/v1"
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-4"
  model_v2:
    base_url: "https://api.openai.com/v1"
    api_key: "${OPENAI_API_KEY}"
    model: "gpt-3.5-turbo"

judge_endpoint:
  base_url: "https://api.openai.com/v1"
  api_key: "${OPENAI_API_KEY}"
  model: "gpt-4"

Full config reference

task

FieldRequiredDescription
------------------------------
descriptionYesClear description of the task models will be tested on
scenarioNoUsage scenario for additional context

target_endpoints.\

FieldDefaultDescription
-----------------------------
base_urlAPI base URL (required)
api_keyAPI key, supports ${ENV_VAR} (required)
modelModel name (required)
system_promptSystem prompt for this endpoint
extra_paramsExtra API params (e.g. temperature, max_tokens)

judge_endpoint

Same fields as target_endpoints.. Use a strong model (e.g. gpt-4, qwen-max) with low temperature (~0.1) for consistent judgments.

query_generation

FieldDefaultDescription
-----------------------------
num_queries20Total number of queries to generate
seed_queriesExample queries to guide generation
categoriesQuery categories with weights for stratified generation
endpointjudge endpointCustom endpoint for query generation
queries_per_call10Queries generated per API call (1–50)
num_parallel_batches3Parallel generation batches
temperature0.9Sampling temperature (0.0–2.0)
top_p0.95Top-p sampling (0.0–1.0)
max_similarity0.85Dedup similarity threshold (0.0–1.0)
enable_evolutionfalseEnable Evol-Instruct complexity evolution
evolution_rounds1Evolution rounds (0–3)
complexity_levels["constraints", "reasoning", "edge_cases"]Evolution strategies

evaluation

FieldDefaultDescription
-----------------------------
max_concurrency10Max concurrent API requests
timeout60Request timeout in seconds
retry_times3Retry attempts for failed requests

output

FieldDefaultDescription
-----------------------------
output_dir./evaluation_resultsOutput directory
save_queriestrueSave generated queries
save_responsestrueSave model responses
save_detailstrueSave detailed results

report

FieldDefaultDescription
-----------------------------
enabledfalseEnable Markdown report generation
language"zh"Report language: "zh" or "en"
include_examples3Examples per section (1–10)
chart.enabledtrueGenerate win-rate chart
chart.orientation"horizontal""horizontal" or "vertical"
chart.show_valuestrueShow values on bars
chart.highlight_besttrueHighlight best model
chart.matrix_enabledfalseGenerate win-rate matrix heatmap
chart.format"png"Chart format: "png", "svg", or "pdf"

Interpreting results

Win rate: percentage of pairwise comparisons a model wins. Each pair is evaluated in both orders (original + swapped) to eliminate position bias.

Rankings example:

  1. gpt4_baseline       [################----] 80.0%
  2. qwen_candidate      [############--------] 60.0%
  3. llama_finetuned      [##########----------] 50.0%

Win matrix: win_matrix[A][B] = how often model A beats model B across all queries.

Checkpoint & resume

The pipeline saves progress after each step. Interrupted runs resume automatically:

  • --fresh — ignore checkpoint, start from scratch
  • --rerun-judge — re-run only the pairwise evaluation step (useful when switching judge models); keeps queries, responses, and rubrics intact
  • Adding new endpoints to config triggers incremental response collection; existing responses are preserved

Output files

evaluation_results/
├── evaluation_results.json     # Rankings, win rates, win matrix
├── evaluation_report.md        # Detailed Markdown report (if enabled)
├── win_rate_chart.png          # Win-rate bar chart (if enabled)
├── win_rate_matrix.png         # Matrix heatmap (if matrix_enabled)
├── queries.json                # Generated test queries
├── responses.json              # All model responses
├── rubrics.json                # Generated evaluation rubrics
├── comparison_details.json     # Pairwise comparison details
└── checkpoint.json             # Pipeline checkpoint

API key by model

Model prefixEnvironment variable
----------------------------------
gpt-, o1-, o3-*OPENAI_API_KEY
claude-*ANTHROPIC_API_KEY
qwen-, dashscope/DASHSCOPE_API_KEY
deepseek-*DEEPSEEK_API_KEY
Custom endpointset api_key + base_url in config

Additional resources

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-20 06:51 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

security-compliance

claude-authenticity

helloml0326
使用9项加权规则检查API端点是否由真正的Claude(非包装器、代理或仿冒者)提供支持,模拟claude-...
★ 0 📥 649
ai-intelligence

ontology

oswalpalash
类型化知识图谱,用于结构化智能体记忆与可组合技能。支持创建/查询实体(人员、项目、任务、事件、文档)及关联...
★ 709 📥 243,539
ai-intelligence

self-improving agent

pskoett
捕获经验教训、错误和纠正,以实现持续改进。使用时机:(1)命令或操作意外失败;(2)用户纠正……
★ 4,056 📥 796,118