AI Coding Agents Need Enforcement Ladders, Not More Prompts
The Data Is In: AI Coding Agents Break Things
75% of AI coding models introduce regressions when maintaining codebases over time (SWE-CI, arxiv 2603.03823). Not on one-shot fixes -- those work. On sustained maintenance across 71 consecutive commits per task. The longer the horizon, the worse it gets.
And it gets worse: developers using AI coding assistants score 17% lower on conceptual understanding, code reading, and debugging assessments (Anthropic, arxiv 2601.20245). The tools designed to help are eroding the team's ability to catch the problems the tools create.
Meanwhile, giving agents more freedom with tools outperforms pre-programmed pipelines by 10.7% (Tsinghua, arxiv 2603.01853). The solution is not less autonomy. It is better enforcement around autonomous agents.
The Root Cause: Prose Enforcement Fails Under Pressure
Every AI team writes rules in markdown files. "Never modify production config." "Always run tests before committing." "Use the existing patterns."
These are suggestions, not enforcement. When the context window fills up -- and it always does -- the model drops these rules first. They are the lowest-priority tokens in the window. The agent does not intentionally violate them; it simply forgets they exist.
This is the structural failure mode of every AI coding setup that relies on prompts alone. The prevent-by-construction alternative encodes rules at levels the model cannot forget.
The Enforcement Ladder: L1 Through L5
The fix is a hierarchy. Each level compounds on the one below:
L1 -- Conversation. "Hey, don't do that." Works once. Forgotten by the next session.
L2 -- Prose documentation. CLAUDE.md rules, README instructions. Better than conversation. Still dropped under context pressure. 3,706 violations tracked in our system started as L2 rules.
L3 -- Templates. Code templates, CI/CD configs, project scaffolds. The right pattern is the easy path. Violations happen when agents go off-template.
L4 -- Tests. Automated test suites that catch violations at commit time. The agent cannot merge if the test fails. This is where enforcement becomes structural.
L5 -- Hooks. Pre-commit hooks, pre-tool-use hooks, runtime guards. The action is physically prevented before it happens. Zero awareness required from the agent or developer.
The principle: every lesson must be encoded where enforcement requires zero awareness. Prose means failure -- justify why structural enforcement is impossible before writing it.
How It Works in Practice
A rule like "never write to the production database" starts at L2 (documented in CLAUDE.md). The first violation gets caught in code review. The second time, it gets promoted to L4 (a test that checks database connection strings). The third time -- there is no third time, because it is now an L5 hook that blocks the commit.
Each promotion reduces the violation surface. The system literally optimizes its own quality.
We shipped 26 specs autonomously with this approach. 960+ commits across two repos. 3,706 violations tracked, diagnosed, and encoded. The enforcement ladder does not slow agents down -- it makes their autonomy safe.
What This Means for Your Team
If you are using AI coding agents today, ask yourself:
- How many of your rules are L2 prose? (Most are.)
- How many violations have you tracked? (Probably zero -- you are not counting.)
- What happens when your agent fills its context window? (Your rules disappear.)
The fix is not more prompts. It is structural enforcement at L4-L5 for your most critical rules.
Get a Free Governance Scan
Run your repository through our free governance scanner to see exactly where your enforcement gaps are. No signup required.
Scan your repo now at walseth.ai/scan
Doug Walseth builds autonomous AI agent systems with built-in enforcement. His converge methodology has been applied to 3 production codebases with 3,706 violations tracked and encoded.
Citations:
- SWE-CI: arxiv.org/abs/2603.03823
- Anthropic skill erosion: arxiv.org/abs/2601.20245
- Tsinghua autonomous search: arxiv.org/abs/2603.01853
Run our open-source governance scanner on any public repository. Six dimensions scored, instant results, no signup required.
Try the Free Governance ScannerGet AI Governance Insights
Practical takes on enforcement automation and EU AI Act readiness. No spam.
Related Articles
Why Detection-Based AI Governance Fails (And What to Do Instead)
Six funded companies detect AI agent violations at runtime. None prevent them structurally. Here's why the detection paradigm has a ceiling — and what prevent-by-construction looks like in production.
6 min readAI Governance Leaderboard: We Scanned 21 Top Repos Before RSA 2026
We ran our governance scanner against 21 of the most popular AI agent frameworks, ML libraries, and AI SDKs. The average score was 53/100. Only 2 repos are on track for EU AI Act readiness. Here are the full results.
6 min readStructural Enforcement vs Arthur AI: Middleware Guardrails Compared
Arthur AI ships middleware guardrails and model monitoring. Structural enforcement prevents violations permanently. Two AI governance philosophies compared.
4 min readFramework Governance Scores
See how major AI/ML frameworks score on enforcement posture, context hygiene, and EU AI Act readiness.
Want to know where your AI governance stands?
Get a Free Governance Audit