> One-liner: LMs are unreliable calculators but reliable coders. When the answer needs
> determinism and precision — arithmetic, exact data manipulation, deterministic transforms —
> emit code and run it. When the answer needs judgment, taste, or open-ended synthesis, reason
> in natural language. The cost of getting this gate wrong is silent: prose arithmetic
> hallucinates a plausible-looking wrong number, and over-coding a judgment task burns a
> sandbox round-trip for nothing.
This is an enhancement overlay. DSPy already gives you dspy.ProgramOfThought (PoT) —
the mechanism for write-then-execute. What it does not give you is the decision rubric for
when to reach for it. That rubric is this skill. Cross-link the sibling
[[agentsop-output-format-by-model]] (which decides how code-shaped content should be serialized) and
[[agentsop-test-fix-loop]] (which closes the execute → error → retry loop).
Activate this skill before committing a step to a reasoning strategy whenever the task has a
verifiable, deterministic core — or whenever you catch an agent doing arithmetic in prose.
| Trigger | Signal |
|---|---|
| --- | --- |
| Arithmetic / math | "compute the compound interest", "what's 17.5% of $4,392.18", multi-step word problems, unit conversions, date deltas |
| Precise data manipulation | "sort these 240 rows by the third column", "dedupe and count", "join these two lists on id", "parse this CSV and sum column B" |
| Deterministic transforms | regex extraction, string reformatting, base conversion, hashing, sorting, set operations |
| Symbolic / combinatorial | "how many distinct permutations", "solve this system of equations", calendar/scheduling math |
| You see a model doing math in prose | "Let me add: 1,204 + 8,991 + ... = 10,195" — almost always worth a code check |
| Choosing a DSPy module | deciding between ChainOfThought and ProgramOfThought for a signature [dspy.ai/learn/programming/modules/] |
Anti-triggers (do NOT reach for code execution):
> LMs are unreliable calculators but reliable coders.
A language model predicts the next token, not the correct value. When you ask it to add
48,217 + 9,884 in prose, it emits the most plausible-looking digit sequence — which is
frequently wrong, and wrong in a way that looks right. The same model can write
48217 + 9884 as a Python expression flawlessly, because emitting the program is a
pattern-matching task it is genuinely good at, and the Python interpreter is a deterministic
oracle. This decoupling — model writes the recipe, interpreter computes the result — is the
entire thesis of Program-of-Thought (PoT) [arXiv 2211.12588] and PAL [arXiv 2211.10435].
┌────────────────────────────────────────────────────────┐
│ THE GATE: does this answer need determinism/precision? │
└────────────────────────────────────────────────────────┘
│ │
YES (computable) NO (judgment)
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ EMIT CODE │ │ REASON IN PROSE │
│ model writes recipe │ │ model is the engine │
│ interpreter = oracle│ │ no oracle exists │
└─────────────────────┘ └─────────────────────┘
│
▼
sandbox → run → feed result back into LM → LM narrates/uses it
Two failure modes the gate prevents:
| Failure | Mechanism | Symptom |
|---|---|---|
| --- | --- | --- |
| Under-coding (reason when you should compute) | LM hallucinates a calculation it cannot reliably perform | A confident, wrong, plausible-looking number; off-by-one counts; arithmetic that "looks" right |
| Over-coding (compute when you should reason) | LM wraps a judgment task in a sandbox round-trip that adds no determinism | Wasted latency + cost; brittle code that encodes a subjective rubric as if it were a formula; print("the tone is friendly") |
Why the asymmetry matters. Under-coding fails silently — the wrong number propagates
downstream and no exception fires. Over-coding fails loudly and cheaply — you notice the
useless sandbox call. So the default lean, when genuinely uncertain and a verifiable core
exists, is toward code. But "genuinely uncertain" is the operative phrase: most judgment
tasks are not close calls.
The format corollary (from [[agentsop-output-format-by-model]]): once you decide to emit code, emit
it as code in a fenced block / single string field — never nested inside JSON sub-structure.
Code-in-JSON measurably degrades the code itself (Aider: 61%→20% on GPT-4 Turbo). The
execution decision and the serialization decision are two separate gates; pass both.
A three-step gate. Run it per step, not per task — one task can have computable steps and
judgment steps interleaved.
Ask: "Is there a single correct answer that a program could verify?"
A useful tiebreaker for the "trivial computable" gray zone: **if the model would be embarrassed
to get it wrong and you'd reach for a calculator yourself, emit code.** If you'd do it in your
head without a second thought, prose is fine.
print/return the result. No network, no filesystem unless the task is I/O.ProgramOfThought uses a Python interpreter (Deno/PythonInterpreter sandbox in recent versions); OpenAI Code Interpreter runs in a managed container; Anthropic code-execution tool runs in a sandboxed VM; LangChain PythonREPLTool runs in-process and is unsandboxed — treat as untrusted-input-hostile.ProgramOfThought defaults to max_iters ≈ 3). After the bound, fall back to prose reasoning or surface the failure — do not loop forever.Exit criterion: the step produces either (a) a code-derived value the LM has consumed, or
(b) a prose judgment, with the gate decision recorded so a reviewer can audit why code was or
wasn't used.
Seven operations. Each row is a reusable move.
| # | Op | Trigger | Action | Output | Evidence |
|---|---|---|---|---|---|
| --- | --- | --- | --- | --- | --- |
| 1 | Computable-vs-judgment gate | Any step about to be reasoned | Apply §3 Step 1: single verifiable answer? | Route to code or prose | PoT premise: decouple compute from reasoning [arXiv 2211.12588] |
| 2 | Decompose mixed steps | Step has both a number and a narrative | Split into computable sub-step (code) + judgment sub-step (prose) | Two routed sub-steps | Mirrors mixed-content split in [[agentsop-output-format-by-model]] §5 Case B |
| 3 | Sandbox choice | Decided to emit code | Pick interpreter by trust + capability: DSPy PoT (Python sandbox), OpenAI Code Interpreter (managed container), Anthropic code-exec (VM), LangChain PythonREPLTool (unsandboxed, in-process) | Chosen runtime | DSPy modules [dspy.ai/learn/programming/modules/]; LangChain PythonREPLTool docs |
| 4 | Result-back-into-LM | Code produced a value | Inject stdout/return value into the next LM turn so the model narrates/uses it | LM-consumed result | PoT design: code computes, LM contextualizes [arXiv 2211.12588] |
| 5 | Error retry (bounded) | Code raised a traceback | Feed error to LM, regenerate, re-run; cap at max_iters (~3) then fall back | Fixed code or graceful fallback | DSPy ProgramOfThought max_iters; see [[agentsop-test-fix-loop]] |
| 6 | Precision escalation | Prose answer involves multi-step arithmetic | Re-route the arithmetic to code even if prose started it | Code-verified number | "LMs are unreliable calculators" — PAL [arXiv 2211.10435] |
| 7 | Over-coding veto | About to sandbox a judgment task | Stop: no deterministic core → reasoning, not code | Prose reasoning, no sandbox call | §5 Case B; avoids wasted round-trip |
In DSPy terms, op #1 is exactly the choice between dspy.ChainOfThought (prose reasoning) and
dspy.ProgramOfThought (emit+run) for a signature — the dspy skill lists the modules but this
overlay supplies the when.
Trigger: A finance-summary agent step: "Given these 14 line items, compute the total,
the 8.25% tax, and the grand total." The agent is a dspy.ChainOfThought module emitting prose.
Constraints:
Decision steps:
running sum that is frequently off by some digits, and no exception will fire. This is the
silent failure the gate exists to prevent.
ChainOfThought to ProgramOfThought. The model now emits subtotal = sum([...]); tax = round(subtotal * 0.0825, 2); total = subtotal + tax and the
interpreter computes it exactly.
total = 4,217.93 and narrates the invoiceline. Code computed; LM contextualized.
JSON program field — code-in-JSON would degrade it.
Outcome: The arithmetic is now deterministic and auditable. PoT-style code execution is the
documented fix for exactly this class of error [arXiv 2211.12588, arXiv 2211.10435].
Extractable operation: Multi-step arithmetic in prose is a smell. Re-route it to code (op #6).
Trigger: A support-triage agent step: "Read this customer message and decide whether the
tone is hostile, neutral, or warm." An over-eager engineer wires it through ProgramOfThought
because "code is more reliable."
Constraints:
if "!!!" in msg: tone = "hostile" — a brittlerule that encodes a subjective rubric as if it were a formula, and is worse than the model's
native judgment.
Decision steps:
judgment. → reason in prose. Stop.
oracle. Code adds no determinism — it just relocates the same judgment into worse,
hard-coded heuristics.
zero precision gain. This is pure waste — the loud, cheap failure mode.
ChainOfThought (or Predict). The LM is the right engine for judgment.sub-step is computable; decompose (op #2) and route the count to code, the tone to prose.
Outcome: No sandbox call. The judgment stays where judgment belongs. Over-coding is a real
and common anti-pattern: not everything benefits from an interpreter, only things with a
verifiable deterministic core.
Extractable operation: No verifiable answer → no code. Veto the sandbox round-trip (op #7).
Trigger: "Summarize this quarter's sales narrative and give me the exact total revenue."
Constraints: One sentence contains both a judgment (summary) and a computation (total).
Decision steps:
ProgramOfThought: code sums the figures, interpreter verifies.ChainOfThought: prose synthesis, no oracle exists.Outcome: Each sub-step uses its correct engine. This is the execution-decision analogue of
the mixed-content two-pass pattern in [[agentsop-output-format-by-model]] §5 Case B.
Extractable operation: One task ≠ one strategy. Gate per step, decompose mixed steps.
where a deterministic, verifiable answer exists. For judgment tasks, code just hard-codes a
subjective rubric and adds a wasted round-trip (Case B). The interpreter is an oracle only for
computable questions.
unreliable calculators; multi-step prose arithmetic hallucinates plausible wrong numbers,
silently (Case A). Any non-trivial computation → code [arXiv 2211.12588].
If the value never re-enters the LM turn, you have a dangling computation the agent can't use
or narrate (op #4).
(max_iters ≈ 3) and fall back to prose or surface the failure. Infinite code-repair loops
burn cost. See [[agentsop-test-fix-loop]].
[[agentsop-output-format-by-model]] says emit it as a fenced block / single string, never as nested
JSON — code-in-JSON degrades the code (Aider 61%→20%). Pass both gates.
PythonREPLTool as a safe sandbox. It runs in-process, unsandboxed.Fine for trusted self-authored code; hostile to untrusted input. Use a real sandbox
(OpenAI Code Interpreter container, Anthropic code-exec VM, DSPy's interpreter) when inputs are untrusted.
sandbox round-trip; the tax exceeds the benefit. The gate is for non-trivial computation.
there is no interpreter; accept the precision risk and flag it.
small arithmetic; the round-trip isn't worth it below a complexity threshold.
answer is acceptable — a deliberate, documented tradeoff, not a default.
How major frameworks expose the emit-code-vs-reason mechanism. This skill operates at the
decision layer; each framework supplies the mechanism.
ProgramOfThought [dspy.ai/learn/programming/modules/]The canonical declarative version. A signature compiled with dspy.ProgramOfThought(Sig)
makes the LM emit Python, runs it in an interpreter sandbox, and feeds the result back —
with bounded retry on error (max_iters). The sibling dspy skill lists ChainOfThought
vs ProgramOfThought as module choices but does not give the decision rubric; *this overlay
is that rubric*. Use ProgramOfThought exactly when §3 Step 1 returns "computable."
A managed container that the model can write Python into and execute, with files and state
persisted across turns. Heavier and stateful — good for data-analysis sessions (load CSV,
compute, plot). Same gate applies: route computable steps in, keep judgment in chat.
A sandboxed VM exposed as a tool; the model emits code, it runs, results return to the
conversation. First-party, sandboxed — safe for untrusted inputs. The execution-decision gate
maps directly: offer the tool, but the model/agent should only invoke it for computable steps.
PythonREPLToolA tool wrapping a Python REPL. Runs in-process and is unsandboxed — powerful and dangerous.
Use only for trusted, self-authored computation; never expose it to untrusted input without an
external sandbox. The decision rubric is identical; the safety profile is worst-in-class.
Framework | Mechanism | Sandbox | Result-back
---------------------|--------------------------|-------------|------------
DSPy PoT | ProgramOfThought module | Python intp | automatic (max_iters)
OpenAI Code Interp. | Assistants code tool | container | persisted state
Anthropic code-exec | code-execution tool | VM (safe) | into conversation
LangChain | PythonREPLTool | NONE (proc) | manual wiring
Every framework can be mis-invoked — pointed at a judgment task (over-code) or skipped for a
computation (under-code). This overlay is the gate that decides invocation, regardless of which
mechanism is underneath.
┌──────────────────────────────────────────────────────────────────────┐
│ EMIT-CODE-VS-REASON DECISION CARD │
├──────────────────────────────────────────────────────────────────────┤
│ Single verifiable answer a program could check? │
│ YES → EMIT CODE → sandbox → run → feed result back into LM │
│ NO → REASON IN PROSE (no oracle exists for judgment) │
│ MIXED → decompose; route each sub-step independently │
├──────────────────────────────────────────────────────────────────────┤
│ Arithmetic / parse / sort / count / regex / symbolic → CODE │
│ Tone / quality / summary / design / synthesis → PROSE │
│ "summary AND total" → SPLIT │
├──────────────────────────────────────────────────────────────────────┤
│ LMs are unreliable calculators but reliable coders. │
│ Under-coding fails SILENTLY (wrong plausible number). │
│ Over-coding fails LOUDLY+CHEAPLY (wasted sandbox round-trip). │
│ When genuinely uncertain AND a verifiable core exists → lean CODE. │
├──────────────────────────────────────────────────────────────────────┤
│ NEVER: │
│ • code a judgment task (over-coding veto) │
│ • do multi-step arithmetic in prose (under-coding) │
│ • forget to feed the code result back into the LM │
│ • nest generated code in JSON (see [[agentsop-output-format-by-model]]) │
│ • loop code-repair unbounded (see [[agentsop-test-fix-loop]]) │
│ • trust LangChain PythonREPLTool on untrusted input (unsandboxed) │
└──────────────────────────────────────────────────────────────────────┘
Primary anchors:
ProgramOfThought module — [dspy.ai/learn/programming/modules/] — emit+run+retry mechanism.Framework / API docs:
PythonREPLTool — [python.langchain.com/docs/integrations/tools/python] (unsandboxed; in-process).Companion / overlaid skills:
dspy-sop-skill/SKILL.md — ships ProgramOfThought but not this decision rubric (the gap this overlay fills).d-output-format-by-model-skill/SKILL.md — sibling: once you emit code, how to serialize it (PoT for math/parse; code never nested in JSON). Cross-linked as [[agentsop-output-format-by-model]].test-fix-loop — the execute → error → retry loop this skill defers to for bounded code repair. Cross-linked as [[agentsop-test-fix-loop]].共 1 个版本