← 返回
数据分析 中文

PinchBench

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting b...
{"answer":"运行PinchBench评估OpenClaw智能体在真实任务中的表现。用于测试模型能力、模型对比、提交b..."}
olearycrew
数据分析 clawhub v1.0.0 1 版本 99846.3 Key: 无需
★ 0
Stars
📥 1,299
下载
💾 8
安装
1
版本
#latest

概述

PinchBench Benchmark Skill

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.

Prerequisites

  • Python 3.10+
  • uv package manager
  • OpenClaw instance (this agent)

Quick Start

cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

Available Tasks (23)

TaskCategoryDescription
-----------------------------
task_00_sanityBasicVerify agent works
task_01_calendarProductivityCalendar event creation
task_02_stockResearchStock price lookup
task_03_blogWritingBlog post creation
task_04_weatherCodingWeather script
task_05_summaryAnalysisDocument summarization
task_06_eventsResearchConference research
task_07_emailWritingEmail drafting
task_08_memoryMemoryContext retrieval
task_09_filesFilesFile structure creation
task_10_workflowIntegrationMulti-step API workflow
task_11_clawdhubSkillsClawHub interaction
task_12_skill_searchSkillsSkill discovery
task_13_image_genCreativeImage generation
task_14_humanizerWritingText humanization
task_15_daily_summaryProductivityDaily digest
task_16_email_triageEmailInbox triage
task_17_email_searchEmailEmail search
task_18_market_researchResearchMarket analysis
task_19_spreadsheet_summaryAnalysisSpreadsheet analysis
task_20_eli5_pdf_summaryAnalysisPDF simplification
task_21_openclaw_comprehensionKnowledgeOpenClaw docs comprehension
task_22_second_brainMemoryKnowledge management

Command Line Options

OptionDescription
---------------------
--modelModel identifier (e.g., anthropic/claude-sonnet-4)
--suiteall, automated-only, or comma-separated task IDs
--output-dirResults directory (default: results/)
--timeout-multiplierScale task timeouts for slower models
--runsNumber of runs per task for averaging
--no-uploadSkip uploading to leaderboard
--registerRequest new API token for submissions
--upload FILEUpload previous results JSON

Token Registration

To submit results to the leaderboard:

# Register for an API token (one-time)
uv run benchmark.py --register

# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4

Results

Results are saved as JSON in the output directory:

# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json

# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json

# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json

Adding Custom Tasks

Create a markdown file in tasks/ following TASK_TEMPLATE.md. Each task needs:

  • YAML frontmatter (id, name, category, grading_type, timeout)
  • Prompt section
  • Expected behavior
  • Grading criteria
  • Automated checks (Python grading function)

Leaderboard

View results at pinchbench.com. The leaderboard shows:

  • Model rankings by overall score
  • Per-task breakdowns
  • Historical performance trends

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 08:19 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

suspicious
查看报告

🔗 相关推荐

data-analysis

Stock Analysis

udiedrichsen
{"answer":"基于雅虎财经数据,分析股票与加密货币。支持投资组合管理、自选股预警、股息分析、8维评分、热门趋势扫描及传闻/早期信号探测。适用于股票分析、持仓追踪、财报异动、加密监控、热门股追踪或提前发掘非主流传闻。"}
★ 270 📥 56,978
data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 198 📥 65,127
data-analysis

A股量化 AkShare

mbpz
A股量化数据分析工具,基于AkShare库获取A股行情、财务数据、板块信息等。用于回答关于A股股票查询、行情数据、财务分析、选股等问题。
★ 165 📥 60,029