The Incident-Response Cycle
In one line: When something is broken in production, stop the bleeding with the fastest reverse lever first, then find the cause — and pin the fix with a failing test and a postmortem so the incident class cannot silently recur.
Do this: Mitigate before you diagnose. Flip the feature flag OFF or revert the deploy to restore a known-good state, then investigate against the live system — never debug a production fault by leaving users exposed while you read code.
The cycle has six steps, in order:
- Detect. An alert, a probe failure, or a user report surfaces the symptom. Capture the evidence (logs, the triggering request, the conversation) before it rotates out — durability of the signal is part of the response.
- Mitigate (the fast reverse lever). Restore service with the cheapest-to-reverse control available: flip the offending feature flag OFF, or revert the deploy to the last green SHA. This is why §15.4's blast-radius posture and a flag-gated, single-config-flip rollout path are load-bearing — the reversibility you designed in at the Decision-Cost Rubric (§2.7) is exactly what you spend here. Mitigation buys time; it is not the fix.
- Root-cause. Run
/systematic-debuggingagainst the now-stable system: reproduce, find the actual cause, distinguish verified from hypothesized (the investigative discipline of §2.6). Do not skip to a fix on a guessed cause. - Write the failing test (regression pin). Before touching the fix, write a test that reproduces the incident and fails — the Bug Fix Cycle (§3.2) applies unchanged. The test is the permanent pin that proves the class is closed and stays closed.
- Postmortem. Record the incident in a short ADR or memory entry that names the incident class — what failed, why mitigation worked, what the regression pin now guards. A fault that produced no error where one was due is a silent-failure incident: the postmortem must call that out, so the next occurrence is loud, not invisible.
- Update memory + STATE.md. Fold the lesson into project memory (one-line index entry, durable-first) and regenerate STATE.md in the same pass, so a fresh session inherits both the fix and the warning. This is memory discipline applied to incidents: the entry point never rots back to the pre-incident state.
Why: Conflating mitigation with diagnosis is the dominant failure mode under pressure — engineers reach for the root cause while production stays broken. Separating "stop the bleeding" from "find the cause" makes the fast lever (flag/deploy) the reflex and the slow work (root-cause, test, postmortem) the follow-through. The regression pin and the named-class postmortem are what convert a one-off scramble into a permanently-closed incident class rather than a recurring fire.