Skip to main content

Evidence Over Claims

In one line: "It should work now" means "I have not verified it." Nothing is complete until fresh output proves it.

Do this: Before marking any task done, paste the actual test output, coverage figure, and clean linter run from the current session.

What: No implementation is complete until fresh verification output (test results, coverage, runtime behavior) confirms the claim.

Why: LLMs produce plausible, confident output regardless of correctness — a structural property, not a bug. Code with correct syntax, type annotations, and coherent docstrings looks right to a reviewer even when it hides a subtle logic error, an off-by-one, or wrong API usage.

The dangerous failure mode is not obviously broken code (caught immediately) but code that passes visual review and fails at runtime: an async function called without await, a query using the wrong column that happens to match another column's type, an exception silently swallowed where it should propagate.

Human review is a probabilistic defense — reviewers see what they look for, and AI error categories don't always match a human's habitual checks. Automated verification (tests, linters, type checkers) is deterministic: it catches errors whether or not anyone thought to look.

The methodology therefore mandates verification at multiple levels:

  1. Tests must run and pass. Not "the tests exist." Not "the tests should pass." The actual test output must be present in the session before a task is marked complete.
  2. Coverage must meet thresholds. 90% for workflow state machine and activities (the system's correctness-critical core), 70% for all other layers. These thresholds are enforced by tooling, not by discipline.
  3. Linting must pass. Post-edit hooks run linters automatically. Code that introduces lint errors is flagged before it reaches review.

Evidence: Enforced by templates/hooks/pre-push-gate.sh and the Stop hook — tests and coverage must pass before a task completes or a push lands. Run the gate and watch a failing test block the push.

The three core gates (§7.1) implement this principle at the tooling level:

  • Layer 1 (post-edit): Linters run automatically after file edits. Immediate feedback.
  • Layer 2 (stop verification): When the AI signals task completion, a verification prompt fires asking for proof of completion. This catches the "it should work now" failure mode directly.
  • Layer 3 (pre-push): Blocking gate before code reaches the remote. Tests must pass, coverage must meet thresholds.

How: The Superpowers verification-before-completion skill governs the completion gate. Before marking any task complete, the skill requires:

  1. Fresh test output (not from a previous run — timestamps must be recent)
  2. Coverage report showing threshold compliance
  3. Linter output showing zero new warnings
  4. For UI changes: screenshot or description of visual verification

The Stop hook (configured in .claude/settings.json) reinforces this at the infrastructure level. When Claude attempts to conclude a response with implementation claims, the hook prompt asks: "Have you shown the test output? Have you verified the change works?" This is a structural safeguard against the AI's tendency to declare victory prematurely.

For the full testing standard, see appendix-a-testing.md. For quality gate architecture, see appendix-e-hooks.md.