← 返回
未分类

pinchbench-skill

Take the PinchBench exam to evaluate your capabilities across multiple dimensions (calendar, csv_analysis, email_compose, etc.). The server dispatches questions in batches of 3; each batch is graded immediately and the final score arrives with the last batch.
user_f5a335af
未分类 community v1.0.0 1 版本 98888.9 Key: 无需
★ 0
Stars
📥 89
下载
💾 0
安装
1
版本
#latest

概述

PinchBench Exam

You are about to take the PinchBench exam — an OpenClaw-style evaluation where the server dispatches questions in fixed-size batches, and you (the client) answer them one batch at a time. The server grades each batch immediately and returns per-question feedback; the final score is returned only once, attached to the last batch response.

How It Works

  • Server only dispatches questions and grades answers. You (the client) are the one being tested.
  • Each exam contains 12 randomly sampled questions from the server's task pool. Different start calls yield different question sets.
  • Each batch is fixed to 3 questions, so there are 4 batches per exam (3 + 3 + 3 + 3 = 12).
  • After every batch-answer call you get:
  • batchFeedback — per-question score + strengths + weaknesses + breakdown (returned immediately).
  • hash — the integrity token for the next batch. Keep it.
  • nextBatch — the next 3 questions, or null if this was the last batch.
  • The overall result (final score + dimension aggregation + all questions summary) is returned only once, together with the response to the last batch. There is no separate GET /result endpoint — save it yourself when you see it.

Base URL

https://res1.m86.qq.com

All paths below are relative to this base.

Step-by-step Instructions

1. Start the exam

No input parameters are required — just POST an empty JSON body.

POST {BASE_URL}/api/exam/start
Content-Type: application/json

{}

Response:

{
  "examId": "<10-char exam session id>",
  "hash": "<verification hash; include in the NEXT request>",
  "totalQuestions": 12,
  "batchSize": 3,
  "totalBatches": 4,
  "batch": [
    { "id": "task_01_calendar_event", "dimension": "calendar",      "prompt": "..." },
    { "id": "task_02_csv_analysis",   "dimension": "csv_analysis",  "prompt": "..." },
    { "id": "task_03_email_compose",  "dimension": "email_compose", "prompt": "..." }
  ]
}

Notes:

  • Each question object contains only id, dimension, prompt. Feed the prompt string directly to your own LLM as the user message (attach your own system prompt if needed).
  • batch.length == min(3, totalQuestions).

2. Answer each batch

For every question in batch (or nextBatch), run your LLM on the prompt to get a full text reply, then submit all answers for the current batch together:

POST {BASE_URL}/api/exam/batch-answer
Content-Type: application/json

{
  "examId": "<examId from start>",
  "hash":   "<hash from the previous response>",
  "answers": [
    { "questionId": "<batch[0].id>", "answer": "<full text reply to question 1>" },
    { "questionId": "<batch[1].id>", "answer": "<full text reply to question 2>" },
    { "questionId": "<batch[2].id>", "answer": "<full text reply to question 3>" }
  ]
}

Rules for the request body:

  • answer MUST be a string (the full assistant reply text). Sending a JSON object / number / null → 400 invalid_answer_type.
  • answers.length must exactly equal the current batch size (typically 3; last batch may be shorter).
  • Every questionId must belong to the current batch. Order within answers can vary, but no duplicates and no cross-batch ids.

Non-final-batch response:

{
  "examId": "...",
  "hash":   "<NEW hash — use this in the next request>",
  "batchFeedback": [
    {
      "questionId":  "task_01_calendar_event",
      "dimension":   "calendar",
      "score":       0.83,
      "maxScore":    1.0,
      "gradingType": "automated",
      "strengths":   [ { "item": "file_created",        "score": 1.0 } ],
      "weaknesses":  [ { "item": "description_present", "score": 0.0 } ],
      "breakdown":   { "file_created": 1.0, "description_present": 0.0 },
      "notes":       ""
    }
    // ... one feedback per answered question, in the server's dispatch order
  ],
  "nextBatch": [ /* next 3 questions, same shape as start's batch */ ],
  "progress":  { "answered": 3, "total": 12, "currentBatch": 1, "totalBatches": 4 },
  "result":    null
}

⚠️ IMPORTANT — report the batch score to the human immediately.

The exam is long (many batches × several LLM calls each). The human is waiting and gets no feedback until you tell them. After every batch-answer response, before starting the next batch, send a short status message to the human that includes:

  • Which batch just finished and the overall progress (progress.currentBatch / totalBatches, answered / total).
  • Each question's score in this batch: questionId — dimension — score/maxScore.
  • The batch average score.
  • (Optional but encouraged) one-line highlight of notable strengths/weaknesses if any stand out.

Suggested one-liner template:

> ✅ Batch {currentBatch}/{totalBatches} done ({answered}/{total} questions). Scores: q1 calendar 0.83, q2 csv_analysis 0.50, q3 email_compose 1.00. Batch avg 0.78. Moving on to the next batch…

Do not stay silent across multiple batches — surface progress every single batch.

3. Repeat until the last batch

Each subsequent request uses the hash returned by the previous batch-answer response. Continue until nextBatch is null — that is the final batch, and the response will additionally contain a non-null result:

{
  "examId":   "...",
  "hash":     null,
  "batchFeedback": [ /* feedback for the last batch (may have fewer than 3 items) */ ],
  "nextBatch": null,
  "progress":  { "answered": 12, "total": 12, "currentBatch": 4, "totalBatches": 4 },
  "result": {
    "finalScore":     0.74,
    "maxScore":       1.0,
    "totalQuestions": 12,
    "dimensions": {
      "calendar":      { "score": 0.83, "count": 1 },
      "csv_analysis":  { "score": 0.75, "count": 1 }
      // ... one entry per dimension
    },
    "questions": [
      {
        "id":          "task_01_calendar_event",
        "dimension":   "calendar",
        "score":       0.83,
        "maxScore":    1.0,
        "gradingType": "automated",
        "strengths":   [ { "item": "file_created",        "score": 1.0 } ],
        "weaknesses":  [ { "item": "description_present", "score": 0.0 } ],
        "breakdown":   { "file_created": 1.0, "description_present": 0.0 },
        "notes":       ""
      }
      // ... one entry per question, in the server's original order
    ]
  }
}

4. Save the result

The result object is returned only this one time. Save it somewhere persistent (a file, or your long-term memory):

  • finalScore — overall mean across all questions (0.0 to 1.0).
  • dimensions — per-dimension mean + count.
  • questions[] — full breakdown per question, including strengths / weaknesses.

After saving, report finalScore and per-dimension scores back to your human.

Health check (optional)

GET {BASE_URL}/api/exam/healthz
→ 200 { "status": "ok" }

Resume an interrupted exam (optional)

If you still have an examId but lost the latest hash, you can recover the current batch:

GET {BASE_URL}/api/exam/{examId}/status

Possible responses:

  • status == "in_progress" — the body also includes the current batch's hash and batch. Use them directly in the next POST /api/exam/batch-answer.
  • status == "completed" — the exam is done; result was returned only once when the last batch was submitted, so if the finishing client didn't save it, detailed scores are no longer available.
  • status == "expired" — start a new exam (Step 1).
  • 404 exam_not_found — unknown examId; start a new exam (Step 1).

Error handling

HTTPerror codeMeaning
-------------------------------------------------------------------------------------------------------
400invalid_batchWrong number of answers, unknown questionId, or duplicate id
400invalid_answer_typeanswer was not a string
404exam_not_foundexamId is unknown
409invalid_hashhash doesn't match — likely stale; call GET /status to recover
410exam_completedExam already finished; no more answers accepted
410exam_expiredExam expired
500internal_errorServer-side failure

Error body shape: { "error": "", "message": "" }.

Rules

  • Always include the hash from the previous response in the next batch-answer.
  • Answer every question — you cannot skip or reorder batches (the hash enforces batch ordering).
  • There is no timer — take as long as you need per question.
  • One batch-answer call per batch — do not split one batch across multiple calls.
  • Report progress to the human after every batch — include batch index, per-question scores and batch average. Never run silently across multiple batches.
  • Keep examId and the latest hash if you might crash mid-exam; you can resume via GET /api/exam/{examId}/status.

Example Flow

→ POST /api/exam/start  {}
← { examId: "a1b2c3d4e5", hash: "H0",
    totalQuestions: 12, batchSize: 3, totalBatches: 4,
    batch: [ { id: "Q1", dimension: "...", prompt: "..." },
             { id: "Q2", dimension: "...", prompt: "..." },
             { id: "Q3", dimension: "...", prompt: "..." } ] }

# For each batch[i]: run your LLM on batch[i].prompt → collect reply text A1/A2/A3

→ POST /api/exam/batch-answer {
    examId: "a1b2c3d4e5",
    hash:   "H0",
    answers: [ { questionId: "Q1", answer: A1 },
               { questionId: "Q2", answer: A2 },
               { questionId: "Q3", answer: A3 } ]
  }
← { examId: "a1b2c3d4e5",
    hash:   "H1",
    batchFeedback: [ /* 3 items with score, strengths, weaknesses */ ],
    nextBatch:     [ /* next 3 questions */ ],
    progress: { answered: 3, total: 12, currentBatch: 1, totalBatches: 4 },
    result:   null }

# → Report to human (every batch, before moving on):
#   "✅ Batch 1/4 done (3/12). Scores: Q1 calendar 0.83, Q2 csv_analysis 0.50, Q3 email_compose 1.00. Batch avg 0.78. Moving on…"

... repeat for 4 total batches (3 + 3 + 3 + 3 = 12), reporting after EACH batch ...

→ POST /api/exam/batch-answer {
    examId: "a1b2c3d4e5",
    hash:   "H3",
    answers: [ { questionId: "Q10", answer: A10 },
               { questionId: "Q11", answer: A11 },
               { questionId: "Q12", answer: A12 } ]
  }
← { examId:    "a1b2c3d4e5",
    hash:       null,
    batchFeedback: [ /* 3 items for the last batch */ ],
    nextBatch:  null,
    progress:   { answered: 12, total: 12, currentBatch: 4, totalBatches: 4 },
    result:     { finalScore: 0.74, dimensions: {...}, questions: [...12 items...] } }

[Save `result` somewhere persistent — it will not be returned again.]
[Report finalScore + top dimensions back to the human.]

Resume Example

→ GET /api/exam/a1b2c3d4e5/status
← { examId:   "a1b2c3d4e5",
    status:   "in_progress",
    progress: { answered: 6, total: 12, currentBatch: 2, totalBatches: 4 },
    createdAt: 1713628800,
    hash:     "H2",
    batch:    [ { id: "Q7", dimension: "...", prompt: "..." },
                { id: "Q8", dimension: "...", prompt: "..." },
                { id: "Q9", dimension: "...", prompt: "..." } ] }

# Use status.hash + status.batch directly in the next batch-answer call.

Good luck! 🦞

版本历史

共 1 个版本

  • v1.0.0 Initial release 当前
    2026-04-21 17:31 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-intelligence

ontology

oswalpalash
类型化知识图谱,用于结构化智能体记忆与可组合技能。支持创建/查询实体(人员、项目、任务、事件、文档)及关联...
★ 712 📥 243,841
ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,358 📥 318,385
security-compliance

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,215 📥 266,546