The incident — fifteen changes, three broken features
Over the course of a single working session, more than fifteen changes were implemented on a production AI tutoring platform. Each change followed the process: Change Request written, reviewed by spawned agents, implemented, audited, committed, and deployed.
Three working features broke. The storytelling feature lost all 37 narrator voices. Session personalization stopped working. Adaptive difficulty adjustments disappeared.
The review agents had scored each change at 8/10 or higher confidence. The audit agents confirmed each implementation matched its spec. The process ran perfectly — and still produced failures.
The root cause wasn't in the code. It was in the plans. The Change Requests contained assumptions that were treated as facts but never verified. Components were classified by their names rather than their actual behavior. Reviews checked whether code matched the spec, but nobody checked whether the spec was correct.
Seven specific disciplines would have prevented every one of these failures. They are not replacements for the CR workflow — they are additions that catch the failure modes the workflow alone cannot.
Failure mode 1: unverified assumptions
Every Change Request contains assumptions about how the system works. The most dangerous assumptions are the ones that feel obviously true.
"This tool is dead code" — assumption: nothing depends on it. "The backend already handles this" — assumption: there's existing logic for this case. "This dictionary is session-scoped" — assumption: the data doesn't persist across sessions.
In the incident, a CR described an in-memory dictionary as session-scoped. It wasn't — it was a module-level global that persisted across all sessions and all requests. Every session added entries. Nothing removed them. The CR proposed adding more data to this dictionary, which accelerated a memory leak that took days to manifest.
Two review agents evaluated the CR. Both confirmed the proposed changes were correct given the described behavior. Neither questioned whether the described behavior was accurate.
The discipline: list and verify
Before finalizing any CR, write down every assumption it makes. Then verify each one against actual evidence — code, database schema, rendered output, or production data. Not by inferring from names or comments. By checking.
If an assumption can't be verified, the CR is not ready.
The cost of verifying is measured in seconds. Ask your AI to read one file, check one schema, or show one rendered prompt. The cost of not verifying is a deployed change built on false premises that two review agents will enthusiastically confirm.
Verify an assumption
Before we finalize this CR — list every assumption it makes about the system. Then verify each one. Show me the evidence.
This single prompt catches the most expensive class of errors in AI-driven development: plans built on false premises. The AI searches the actual code and data rather than relying on descriptions.
Failure mode 2: reviews that check code, not product
Code correctness asks: "Does the implementation match the specification?"
Product correctness asks: "Will the user still have a good experience after this change?"
Most review processes only check the first question. When the specification itself is wrong — because it was built on wrong assumptions, or because it misidentified what a component does — code correctness confirms the mistake rather than catching it.
In the incident, a CR proposed removing twelve tools classified as unnecessary. Two independent review agents confirmed all twelve removals were clean: no import errors, no broken references, consistent patterns, 9/10 confidence.
One of those twelve tools — switch_voice — powered the entire storytelling feature. Another triggered adaptive learning adjustments. A third stored personalization data.
The reviews were technically perfect. The code matched the spec. The spec was the problem.
The discipline: ask both questions
Every review prompt should explicitly include two questions:
- Does this code correctly implement the CR? (code correctness)
- After this change, will any working feature stop working or degrade? (product correctness)
Review agents follow instructions literally. If you only ask question 1, that's the only question they'll answer. Adding question 2 costs nothing but catches the entire category of "the code is correct but the plan is wrong" failures.
Add product correctness to reviews
Review this CR. Check both: (1) Does the code correctly implement the spec? (2) After this change, will any working feature stop working or degrade for the user?
The second question forces the review agent to think about user impact, not just code correctness. This catches plan-level errors that code-only reviews miss entirely.
Failure mode 3: names lie
A function called switch_voice could be a debugging utility, a settings toggle, or the engine behind a storytelling feature with 37 narrator personas. The name tells you what someone intended when they wrote it, not what it evolved into.
In the incident, twelve tools were classified based on their names:
switch_voice→ "backend-driven persona rotation"log_session_event→ "observability logging"set_memory→ "debugging utility"
None of these classifications were correct. switch_voice powered storytelling. log_session_event triggered adaptive difficulty. set_memory stored user personalization.
The discipline: four-step verification
Before removing, renaming, or modifying any component:
- Read the implementation description — what do the docs, CRs, or skill files say it does?
- Check the rendered output — what does the user actually experience? For AI prompts, see the rendered prompt from a real session. For tools, check what happens when the AI calls them.
- Grep every file for references — search the entire project for every mention. One missed reference means one runtime failure after deploy.
- Identify the product feature — not "what does the code do?" but "what product experience disappears if I remove this?"
Step 4 is the one most people skip — and the one that prevents the most damage. A builder who asks "what will the user lose?" before removing anything would have caught every tool misclassification in the incident.
Failure mode 4: shallow audits miss structural damage
Standard post-implementation audits check three things: the code matches the spec, there are no bugs, and the code follows existing patterns. These three checks catch most issues.
But the incident revealed three additional failure modes that standard audits missed:
Defensive hacks — code that "works" by adding unnecessary protections instead of fixing the root cause. Wrapping an immutable data structure in a mutable one because something else expects mutability. Adding redundant null checks because a value "might" be null. Using try/except to swallow errors silently. These hacks mask the real problem and make future debugging harder.
Structural conflicts — changes that create naming collisions or state management issues. A new Python package directory that shadows an existing module name (causing import failures). Session data stored in a module-level dictionary that persists across all sessions (causing memory leaks). Initialization code that depends on services being registered in a specific order (causing test failures).
Band-aid fixes — addressing symptoms instead of root causes. A function crashes on the wrong data type? A band-aid adds a type conversion at the call site. A root-cause fix ensures the type is correct at the source.
The discipline: expanded audit checklist
Add three questions to your audit prompts:
- Is this a clean, root-cause fix — or a defensive hack that masks the symptom?
- Does this change create any structural conflicts (naming collisions, shared state, initialization order dependencies)?
- Are there any band-aid fixes that address symptoms rather than causes?
Failure mode 5: scattered config with no justification trail
In the incident, tool availability was controlled by thirteen separate Python sets and dictionaries scattered across two files. Each list served a different purpose: which tools are visible, which are silent, which advance the session, which are media-related. There was no documentation explaining why each tool appeared in each list.
Removing a tool from one list but not the other twelve caused runtime errors. Removing a tool entirely broke features because nobody could check why it was there. Adding a new tool required updating thirteen places, and missing one caused inconsistent behavior.
The discipline: config registries with justifications
Any configuration that controls system behavior should have:
- A single source of truth — one file or registry where all entries are defined
- A justification field for each entry explaining WHY it exists (not just what it does)
- Derived sets computed from the registry, not manually maintained duplicates
- Validation that catches mismatches between the registry and the actual code
The justification field is the key innovation. When someone wants to remove an entry, they read the justification first. "Powers the storytelling feature with 37 narrator personas" is harder to casually delete than an entry in an uncommented list.
Derived sets eliminate the scattered-lists problem entirely. The "silent tools" set is computed by filtering the registry for tools marked as silent. There is no separate list to maintain — or forget to update.
The seven rules — summary
These seven disciplines address the five failure modes above plus two foundational rules from the CR workflow:
| # | Rule | What it prevents | |---|------|------------------| | 1 | No code without a reviewed plan | Quick fixes with untested edge cases | | 2 | Verify every assumption before writing code | Plans built on false premises | | 3 | Understand the feature, not just the code | Removing components that power working features | | 4 | Review product correctness, not just code correctness | Spec is wrong but code matches it | | 5 | Expanded audit checklist | Defensive hacks, structural conflicts, band-aids | | 6 | Centralized config with justifications | Scattered config with no documentation trail | | 7 | Guide focus, don't constrain flow | Rigid phase gates in naturally fluid interactions |
Rules 1 and 7 are covered in detail in AI Development Change Requests and AI Development Quality Gates. Rules 2 through 6 are the additions from this article.
Note on Rule 7: this applies specifically to projects where an AI agent acts on behalf of a user (tutoring, conversation, creative tools). For sequential workflows (onboarding forms, checkout flows), hard phase gates work fine.
The template — paste into your CLAUDE.md
These rules are designed to be self-contained. Copy the block below into your project's CLAUDE.md file. Customize Rule 6 (guide focus) based on whether your project includes an AI agent — if it doesn't, remove it.
The AI Builder Starter Kit includes this template alongside the CR workflow, changelog format, and project structure.
``markdown
## Process Discipline
1. No code without a reviewed plan No changes until: written plan exists, reviewed by at least one agent, reviewer confirmed both code correctness AND product correctness.
2. Verify every assumption before writing code - List assumptions when writing the plan - Verify each against actual code, data, or output - No implementation until all verified
3. Understand the feature, not just the code Before modifying anything: read implementation, check rendered output, grep all references, identify the product feature it supports.
4. Audit checklist — beyond "does it match the spec" Also check: clean root-cause fix (no defensive hacks), no structural conflicts, no band-aids masking symptoms.
5. Config with documented justifications Single source of truth, justification field per entry, derived sets, validation that catches mismatches.
6. Guide focus, don't constrain flow (AI agent projects) Sequential flows: hard phase gates. Fluid flows: focus attention without blocking the agent from doing the right thing. ``
Frequently asked questions
- What is process discipline in AI development?
- Process discipline is a set of rules that prevent the failure modes that standard CR workflows and quality gates miss — specifically, plans built on unverified assumptions, reviews that only check code correctness, and modifications based on component names rather than actual behavior.
- Why do AI code reviews miss plan-level errors?
- Review agents follow instructions literally. If you ask 'does the code match the spec,' they check exactly that. They don't question whether the spec itself is correct unless you explicitly ask them to also check product correctness — whether the user experience is maintained after the change.
- How do I verify assumptions in a Change Request?
- List every assumption the CR makes about the system ('this is unused,' 'the backend handles this,' 'this is session-scoped'). Then ask your AI to verify each one by reading actual code, checking database schema, or examining rendered output. Mark each as VERIFIED or FAILED with evidence before finalizing the CR.
- What is the difference between code correctness and product correctness?
- Code correctness checks whether the implementation matches the specification. Product correctness checks whether the user will still have a good experience after the change. When the spec is wrong, code correctness confirms the mistake — only product correctness catches it.