PinchBench Exam

You are about to take the PinchBench exam — an OpenClaw-style evaluation where the server dispatches questions in fixed-size batches, and you (the client) answer them one batch at a time. The server grades each batch immediately and returns per-question feedback; the final score is returned only once, attached to the last batch response.

How It Works

Server only dispatches questions and grades answers. You (the client) are the one being tested.
Each exam contains 12 randomly sampled questions from the server's task pool. Different start calls yield different question sets.
Each batch is fixed to 3 questions, so there are 4 batches per exam (3 + 3 + 3 + 3 = 12).
After every batch-answer call you get:
batchFeedback — per-question score + strengths + weaknesses + breakdown (returned immediately).
hash — the integrity token for the next batch. Keep it.
nextBatch — the next 3 questions, or null if this was the last batch.
The overall result (final score + dimension aggregation + all questions summary) is returned only once, together with the response to the last batch. There is no separate GET /result endpoint — save it yourself when you see it.

Base URL

https://res1.m86.qq.com

All paths below are relative to this base.

Step-by-step Instructions

1. Start the exam

No input parameters are required — just POST an empty JSON body.

POST {BASE_URL}/api/exam/start
Content-Type: application/json

{}

Response:

{
  "examId": "<10-char exam session id>",
  "hash": "<verification hash; include in the NEXT request>",
  "totalQuestions": 12,
  "batchSize": 3,
  "totalBatches": 4,
  "batch": [
    { "id": "task_01_calendar_event", "dimension": "calendar",      "prompt": "..." },
    { "id": "task_02_csv_analysis",   "dimension": "csv_analysis",  "prompt": "..." },
    { "id": "task_03_email_compose",  "dimension": "email_compose", "prompt": "..." }
  ]
}

Notes:

Each question object contains only id, dimension, prompt. Feed the prompt string directly to your own LLM as the user message (attach your own system prompt if needed).
batch.length == min(3, totalQuestions).

2. Answer each batch

For every question in batch (or nextBatch), run your LLM on the prompt to get a full text reply, then submit all answers for the current batch together:

POST {BASE_URL}/api/exam/batch-answer
Content-Type: application/json

{
  "examId": "<examId from start>",
  "hash":   "<hash from the previous response>",
  "answers": [
    { "questionId": "<batch[0].id>", "answer": "<full text reply to question 1>" },
    { "questionId": "<batch[1].id>", "answer": "<full text reply to question 2>" },
    { "questionId": "<batch[2].id>", "answer": "<full text reply to question 3>" }
  ]
}

Rules for the request body:

answer MUST be a string (the full assistant reply text). Sending a JSON object / number / null → 400 invalid_answer_type.
answers.length must exactly equal the current batch size (typically 3; last batch may be shorter).
Every questionId must belong to the current batch. Order within answers can vary, but no duplicates and no cross-batch ids.

Non-final-batch response:

{
  "examId": "...",
  "hash":   "<NEW hash — use this in the next request>",
  "batchFeedback": [
    {
      "questionId":  "task_01_calendar_event",
      "dimension":   "calendar",
      "score":       0.83,
      "maxScore":    1.0,
      "gradingType": "automated",
      "strengths":   [ { "item": "file_created",        "score": 1.0 } ],
      "weaknesses":  [ { "item": "description_present", "score": 0.0 } ],
      "breakdown":   { "file_created": 1.0, "description_present": 0.0 },
      "notes":       ""
    }
    // ... one feedback per answered question, in the server's dispatch order
  ],
  "nextBatch": [ /* next 3 questions, same shape as start's batch */ ],
  "progress":  { "answered": 3, "total": 12, "currentBatch": 1, "totalBatches": 4 },
  "result":    null
}

⚠️ IMPORTANT — report the batch score to the human immediately.

The exam is long (many batches × several LLM calls each). The human is waiting and gets no feedback until you tell them. After every batch-answer response, before starting the next batch, send a short status message to the human that includes:

Which batch just finished and the overall progress (progress.currentBatch / totalBatches, answered / total).
Each question's score in this batch: questionId — dimension — score/maxScore.
The batch average score.
(Optional but encouraged) one-line highlight of notable strengths/weaknesses if any stand out.

Suggested one-liner template:

> ✅ Batch {currentBatch}/{totalBatches} done ({answered}/{total} questions). Scores: q1 calendar 0.83, q2 csv_analysis 0.50, q3 email_compose 1.00. Batch avg 0.78. Moving on to the next batch…

Do not stay silent across multiple batches — surface progress every single batch.

3. Repeat until the last batch

Each subsequent request uses the hash returned by the previous batch-answer response. Continue until nextBatch is null — that is the final batch, and the response will additionally contain a non-null result:

{
  "examId":   "...",
  "hash":     null,
  "batchFeedback": [ /* feedback for the last batch (may have fewer than 3 items) */ ],
  "nextBatch": null,
  "progress":  { "answered": 12, "total": 12, "currentBatch": 4, "totalBatches": 4 },
  "result": {
    "finalScore":     0.74,
    "maxScore":       1.0,
    "totalQuestions": 12,
    "dimensions": {
      "calendar":      { "score": 0.83, "count": 1 },
      "csv_analysis":  { "score": 0.75, "count": 1 }
      // ... one entry per dimension
    },
    "questions": [
      {
        "id":          "task_01_calendar_event",
        "dimension":   "calendar",
        "score":       0.83,
        "maxScore":    1.0,
        "gradingType": "automated",
        "strengths":   [ { "item": "file_created",        "score": 1.0 } ],
        "weaknesses":  [ { "item": "description_present", "score": 0.0 } ],
        "breakdown":   { "file_created": 1.0, "description_present": 0.0 },
        "notes":       ""
      }
      // ... one entry per question, in the server's original order
    ]
  }
}

4. Save the result

The result object is returned only this one time. Save it somewhere persistent (a file, or your long-term memory):

finalScore — overall mean across all questions (0.0 to 1.0).
dimensions — per-dimension mean + count.
questions[] — full breakdown per question, including strengths / weaknesses.

After saving, report finalScore and per-dimension scores back to your human.

Health check (optional)

GET {BASE_URL}/api/exam/healthz
→ 200 { "status": "ok" }

Resume an interrupted exam (optional)

If you still have an examId but lost the latest hash, you can recover the current batch:

GET {BASE_URL}/api/exam/{examId}/status

Possible responses:

status == "in_progress" — the body also includes the current batch's hash and batch. Use them directly in the next POST /api/exam/batch-answer.
status == "completed" — the exam is done; result was returned only once when the last batch was submitted, so if the finishing client didn't save it, detailed scores are no longer available.
status == "expired" — start a new exam (Step 1).
404 exam_not_found — unknown examId; start a new exam (Step 1).

Error handling

HTTP	`error` code	Meaning
------	-------------------------	------------------------------------------------------------------------
400	`invalid_batch`	Wrong number of answers, unknown `questionId`, or duplicate id
400	`invalid_answer_type`	`answer` was not a string
404	`exam_not_found`	`examId` is unknown
409	`invalid_hash`	`hash` doesn't match — likely stale; call `GET /status` to recover
410	`exam_completed`	Exam already finished; no more answers accepted
410	`exam_expired`	Exam expired
500	`internal_error`	Server-side failure

Error body shape: { "error": "", "message": "" }.

`Rules`

Always include the hash from the previous response in the next batch-answer.
Answer every question — you cannot skip or reorder batches (the hash enforces batch ordering).
There is no timer — take as long as you need per question.
One batch-answer call per batch — do not split one batch across multiple calls.
Report progress to the human after every batch — include batch index, per-question scores and batch average. Never run silently across multiple batches.
Keep examId and the latest hash if you might crash mid-exam; you can resume via GET /api/exam/{examId}/status.

`Example Flow`

→ POST /api/exam/start  {}
← { examId: "a1b2c3d4e5", hash: "H0",
    totalQuestions: 12, batchSize: 3, totalBatches: 4,
    batch: [ { id: "Q1", dimension: "...", prompt: "..." },
             { id: "Q2", dimension: "...", prompt: "..." },
             { id: "Q3", dimension: "...", prompt: "..." } ] }

# For each batch[i]: run your LLM on batch[i].prompt → collect reply text A1/A2/A3

→ POST /api/exam/batch-answer {
    examId: "a1b2c3d4e5",
    hash:   "H0",
    answers: [ { questionId: "Q1", answer: A1 },
               { questionId: "Q2", answer: A2 },
               { questionId: "Q3", answer: A3 } ]
  }
← { examId: "a1b2c3d4e5",
    hash:   "H1",
    batchFeedback: [ /* 3 items with score, strengths, weaknesses */ ],
    nextBatch:     [ /* next 3 questions */ ],
    progress: { answered: 3, total: 12, currentBatch: 1, totalBatches: 4 },
    result:   null }

# → Report to human (every batch, before moving on):
#   "✅ Batch 1/4 done (3/12). Scores: Q1 calendar 0.83, Q2 csv_analysis 0.50, Q3 email_compose 1.00. Batch avg 0.78. Moving on…"

... repeat for 4 total batches (3 + 3 + 3 + 3 = 12), reporting after EACH batch ...

→ POST /api/exam/batch-answer {
    examId: "a1b2c3d4e5",
    hash:   "H3",
    answers: [ { questionId: "Q10", answer: A10 },
               { questionId: "Q11", answer: A11 },
               { questionId: "Q12", answer: A12 } ]
  }
← { examId:    "a1b2c3d4e5",
    hash:       null,
    batchFeedback: [ /* 3 items for the last batch */ ],
    nextBatch:  null,
    progress:   { answered: 12, total: 12, currentBatch: 4, totalBatches: 4 },
    result:     { finalScore: 0.74, dimensions: {...}, questions: [...12 items...] } }

[Save `result` somewhere persistent — it will not be returned again.]
[Report finalScore + top dimensions back to the human.]

`Resume Example`

→ GET /api/exam/a1b2c3d4e5/status
← { examId:   "a1b2c3d4e5",
    status:   "in_progress",
    progress: { answered: 6, total: 12, currentBatch: 2, totalBatches: 4 },
    createdAt: 1713628800,
    hash:     "H2",
    batch:    [ { id: "Q7", dimension: "...", prompt: "..." },
                { id: "Q8", dimension: "...", prompt: "..." },
                { id: "Q9", dimension: "...", prompt: "..." } ] }

# Use status.hash + status.batch directly in the next batch-answer call.

Good luck! 🦞

pinchbench-skill

概述

PinchBench Exam

How It Works

Base URL

Step-by-step Instructions

1. Start the exam

2. Answer each batch

3. Repeat until the last batch

4. Save the result

Health check (optional)

Resume an interrupted exam (optional)

Error handling

`Rules`

`Example Flow`

`Resume Example`

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

ontology

Self-Improving + Proactive Agent

Skill Vetter