概述

taku-debug — 4-Phase Root Cause Investigation

Random fixes create new bugs. Systematic investigation finds root causes. This skill merges the best of Superpowers' systematic debugging with gstack's investigate workflow into a unified 4-phase process.

Rule labels: [IRON LAW] means a non-negotiable correctness constraint. [GUIDANCE] means a strong default that may adapt when context justifies it.

When to Use This Skill

Use taku-debug when:

the VERIFY phase fails and the failure path is not yet understood
the user reports broken behavior, a regression, or an intermittent issue
you need to explain why something is failing before proposing the fix

Do not use taku-debug just to run routine verification. Run the repo's normal checks first. Enter this skill when those checks fail or when production behavior is already broken.

The Iron Law

[IRON LAW] NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST.

Every fix that doesn't address root cause makes the next bug harder to find. If you haven't completed Phase 1 (INVESTIGATE), you cannot propose fixes. Period.

Violating the letter of this process is violating the spirit of debugging.

Mandatory sequence enforcement: Before writing ANY code change, you MUST:

Complete Phase 1 (INVESTIGATE) — gather evidence, reproduce, trace data flow
Complete Phase 2 (PATTERN) — match against known patterns
Complete Phase 3 (HYPOTHESIS) — rank and test hypotheses

Only after all three phases may you proceed to Phase 4 (IMPLEMENT). If you find yourself writing a fix before producing a DEBUG REPORT, stop and go back to Phase 1.

The 4 Phases

Phase 1: INVESTIGATE

Gather context before forming any hypothesis. This is evidence collection, not problem-solving.

Why evidence-first: Jumping to solutions before understanding the problem is the single most common debugging failure. You fix what you think is wrong, not what's actually wrong. Evidence-first prevents this bias.

Read the error. Stack traces, error messages, log output. Read them completely. Line numbers, file paths, error codes — all of it. The error message usually contains the answer.

Reproduce consistently. What are the exact steps? Does it happen every time? If not reproducible, gather more data. Don't guess at intermittent bugs.

Check recent changes.

```bash

git log --oneline -20 --

git diff HEAD~1 --

```

Was this working before? What changed? A regression means the root cause is in the diff.

Trace data flow backward. Where does the bad value originate? What called this function with the bad value? Keep tracing up until you find the source. Fix at source, not at symptom.

Gather evidence in multi-component systems. When the system has layers (API → service → database), add diagnostic logging at each boundary. Run once. See WHERE it breaks. Then investigate THAT layer.

Budget check: If INVESTIGATE exceeds 10 minutes or produces >20 evidence items without convergence toward a root cause hypothesis, pause and present findings so far. This is not failure — it's scope-aware investigation. Complex distributed systems sometimes need a narrower scope before continuing.

Output: Root cause hypothesis — a specific, testable claim about what is wrong and why.

Phase 2: PATTERN

Match the bug against known patterns before proposing solutions.

Why pattern-match first: Most bugs are instances of well-known failure modes. Recognizing the pattern tells you where to look and what to check, cutting investigation time from hours to minutes. Novel bugs are rare; your bug has almost certainly happened before.

| Pattern | Signature | Where to look |

|---------|-----------|---------------|

| Race condition | Intermittent, timing-dependent | Shared state, concurrent access |

| Null propagation | TypeError, NoMethodError | Missing guards on optional values |

| State corruption | Inconsistent data, partial updates | Transactions, callbacks, hooks |

| Integration failure | Timeout, unexpected response | API boundaries, service calls |

| Config drift | Works locally, fails elsewhere | Env vars, feature flags, secrets |

| Stale cache | Shows old data | Redis, CDN, browser cache |

Also check:

Working examples in the same codebase — what does the same thing correctly?
TODOS.md for related known issues
Git log for prior fixes in the same area. Recurring bugs in the same files are an architectural smell, not a coincidence.
.taku/learnings/{project-slug}.jsonl — similar bug patterns, prior root causes, known pitfalls from previous debug sessions

If the pattern isn't obvious, search the web for "{framework} {error type}" (sanitize first: strip hostnames, IPs, file paths, customer data).

Common failure patterns worth checking early:

Config drift: docs, defaults, env names, or runtime lookup disagree.
Boundary mismatch: caller and callee disagree about nullability, auth, or ownership.
State race: async ordering, caching, or lifecycle cleanup changes what later code sees.
Platform path/permission drift: Windows, symlinks, executable bits, or temp paths behave differently.

Phase 3: HYPOTHESIS

Form hypotheses ranked by likelihood. Test them one at a time.

Why rank and isolate: Ranking forces honest probability assessment — you test the most likely cause first, not the most interesting one. Testing one at a time means each test produces clear signal: confirmed or denied. Multiple simultaneous changes produce ambiguous results.

Rank hypotheses:

```

H1 (80%): Token expires between redirect and callback

H2 (15%): CORS header missing on callback endpoint

H3 (5%): Browser caches stale auth response

```

Ranking forces you to think about probability, not just possibility. Test the most likely hypothesis first.

Test minimally. One variable at a time. The smallest change that confirms or denies the hypothesis. Add a log statement, an assertion, or a debug output. Run the reproduction.

If confirmed: Proceed to Phase 4 (IMPLEMENT).

If denied: Move to the next hypothesis. Before forming a new one, return to Phase 1 and gather more evidence. Don't guess.

3-strike rule: If 3 hypotheses fail, STOP. This is not a failed hypothesis — this is a wrong architecture or a fundamentally misunderstood system. Present options:

Continue investigating with a new theory
Escalate for human review
Add instrumentation and wait for the next occurrence

Red flags — if you catch yourself thinking any of these, STOP:

"Quick fix for now, investigate later"
"Just try changing X and see what happens"
"Add multiple changes at once, run tests"
Proposing solutions before tracing data flow
"One more fix attempt" after 2+ failures

Phase 4: IMPLEMENT

Fix the root cause, not the symptom.

Why root-cause-only: Symptom fixes suppress the error at the crash site, but the underlying problem persists and will surface elsewhere — usually in a harder-to-diagnose form. A null check at the crash site doesn't fix the auth middleware that produced the null.

Write a failing test first. The simplest reproduction of the bug as an automated test. If no test framework exists, write a one-off script. The test MUST fail before the fix.

Implement the minimal fix. One change. One root cause addressed. No "while I'm here" refactoring. No bundled improvements.

Verify the fix. Test passes? Full suite passes? Original bug scenario resolved?

Blast radius check. If the fix touches more than 5 files, pause and justify each one. If the justification is sound (e.g., a cross-service API change with matching test files), proceed. If changes span more than 3 subsystems, consider splitting into multiple focused fix PRs.

Why the 5-file guideline: A genuine root cause fix is usually localized — it changes one thing in one place. If you're touching 5+ files, you're likely fixing symptoms across multiple layers rather than the single source. But some fixes are genuinely cross-cutting (e.g., renaming a shared interface). The guideline catches shallow fixes without blocking legitimate broad fixes.

Run the full test suite. Paste the output. No regressions.

Known Pitfalls

Fixing the symptom instead of root cause. The error was a NullPointerException on user.getEmail(). The fix added a null check: user?.getEmail() ?? "". Two days later, a different code path crashed because user was null — the real bug was in the auth middleware that failed to populate the session.

What went wrong: The null pointer was a symptom. The root cause was upstream. The fix suppressed the error at the crash site instead of fixing where the null originated.

Prevention: Phase 1 Step 4 (trace data flow backward) exists specifically for this. Never fix at the crash site without tracing where the bad value comes from. Fix at the source.

Testing multiple hypotheses simultaneously. Three hypotheses were on the list. To save time, all three fixes were applied at once. The bug disappeared. Now none of the three fixes can be individually verified — which one actually fixed it? Is the fix correct, or did two wrong fixes cancel each other out?

What went wrong: Parallel hypothesis testing saves time in the moment but destroys diagnostic value. You learn nothing from a bulk fix.

Prevention: One hypothesis at a time. One change. One test. If the first fix works, you're done. If it doesn't, you still have a clean system for the next hypothesis. Phase 3 Step 2 says "one variable at a time" for a reason.

Skipping Phase 1 because the cause "seems obvious." A 500 error appeared right after a deploy. The obvious fix was to revert the deploy. After reverting, the error persisted — it was actually caused by a database migration that ran during deploy and wasn't included in the revert.

What went wrong: "Obvious" causes are usually the first thing that comes to mind, not the actual cause. Skipping investigation means missing the real problem.

Prevention: Complete Phase 1 (INVESTIGATE) every time, even when the cause seems obvious. The Iron Law exists because "obvious" causes are wrong often enough to justify the discipline.

Not writing a regression test for the fix. A race condition was found and fixed with a mutex. Six weeks later, a refactoring removed the mutex as "unnecessary overhead." The race condition returned in production.

What went wrong: The fix wasn't protected by a regression test. Without a test, future developers can't know why the code exists.

Prevention: Phase 4 Step 1 requires a failing test first. This test becomes the regression guard. A bug without a regression test is a bug waiting to happen again.

Output: Debug Report

Keep the report short and diagnostic. Name the concrete file/function/condition

that caused the bug. Put long investigation notes in a brief appendix only when

they are needed to make the fix reviewable.

DEBUG REPORT
═════════════════════════════════════════
Symptom:         [what the user observed]
Root cause:      [file/function/condition that was actually wrong]
Hypotheses:      [confirmed H1, denied H2/H3 if relevant]
Fix:             [minimal change, file:line references]
Evidence:        [red test, green test, reproduction showing fix works]
Regression test: [file:line of new test]
Status:          DONE | DONE_WITH_CONCERNS | BLOCKED
═════════════════════════════════════════

Status

DONE — Root cause found, fix applied, regression test written, all tests pass.
DONE_WITH_CONCERNS — Fixed but can't fully verify (intermittent bug, requires staging).
BLOCKED — Root cause unclear after 3 hypotheses. Escalated.

版本历史

共 1 个版本

v1.0.0 当前

2026-05-21 15:40 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)