Evidence Over Claims

In one line: "It should work now" means "I have not verified it." Nothing is complete until fresh output proves it.

Do this: Before marking any task done, paste the actual test output, coverage figure, and clean linter run from the current session.

What: No implementation is complete until fresh verification output (test results, coverage, runtime behavior) confirms the claim.

Why: LLMs produce plausible, confident output regardless of correctness — a structural property, not a bug. Code with correct syntax, type annotations, and coherent docstrings looks right to a reviewer even when it hides a subtle logic error, an off-by-one, or wrong API usage.

The dangerous failure mode is not obviously broken code (caught immediately) but code that passes visual review and fails at runtime: an async function called without await, a query using the wrong column that happens to match another column's type, an exception silently swallowed where it should propagate.

Human review is a probabilistic defense — reviewers see what they look for, and AI error categories don't always match a human's habitual checks. Automated verification (tests, linters, type checkers) is deterministic: it catches errors whether or not anyone thought to look.

The methodology therefore mandates verification at multiple levels:

Tests must run and pass. Not "the tests exist." Not "the tests should pass." The actual test output must be present in the session before a task is marked complete.
Coverage must meet thresholds. 90% for workflow state machine and activities (the system's correctness-critical core), 70% for all other layers. These thresholds are enforced by tooling, not by discipline.
Linting must pass. Post-edit hooks run linters automatically. Code that introduces lint errors is flagged before it reaches review.

Evidence: templates/hooks/pre-push-gate.sh enforces lint, format and tests, and fails closed on both directions of not-knowing: it exits 2 when it cannot find its source root, and exits 2 when S4U_TEST_CMD is unset rather than guessing a runner and finding nothing. Verify it yourself — S4U_TEST_CMD='sh -c "exit 1"' against the hook returns 2. Coverage (section 7) remains genuinely optional and is labelled so. Until 2026-08-02 the test block shipped commented out under an "(Optional)" heading while this line claimed enforcement; the claim is now mechanized rather than withdrawn.

The three core gates (§7.1) implement this principle at the tooling level:

Layer 1 (post-edit): Linters run automatically after file edits. Immediate feedback.
Layer 2 (stop verification): When the AI signals task completion, a verification prompt fires asking for proof of completion. This catches the "it should work now" failure mode directly.
Layer 3 (pre-push): Blocking gate before code reaches the remote. Tests must pass, coverage must meet thresholds. It sees only a push the agent itself runs — a human push or a web-UI merge bypasses it.
Layer 4 (merge): Required status checks on the default branch. The only layer that covers every actor; pair it with a post-hoc merge observer wherever an admin bypass exists.

How: The Superpowers verification-before-completion skill governs the completion gate. Before marking any task complete, the skill requires:

Fresh test output (not from a previous run — timestamps must be recent)
Coverage report showing threshold compliance
Linter output showing zero new warnings
For UI changes: screenshot or description of visual verification

The Stop hook (configured in .claude/settings.json) reinforces this at the infrastructure level. When Claude attempts to conclude a response with implementation claims, the hook prompt asks: "Have you shown the test output? Have you verified the change works?" This is a structural safeguard against the AI's tendency to declare victory prematurely.

For the full testing standard, see appendix-a-testing.md. For quality gate architecture, see appendix-e-hooks.md.