AI Coding Agents Need Enforcement Ladders, Not More Prompts

March 16, 20264 min readEnforcement & Governance

The Data Is In: AI Coding Agents Break Things

75% of AI coding models introduce regressions when maintaining codebases over time (SWE-CI, arxiv 2603.03823). Not on one-shot fixes -- those work. On sustained maintenance across 71 consecutive commits per task. The longer the horizon, the worse it gets.

And it gets worse: developers using AI coding assistants score 17% lower on conceptual understanding, code reading, and debugging assessments (Anthropic, arxiv 2601.20245). The tools designed to help are eroding the team's ability to catch the problems the tools create.

Meanwhile, giving agents more freedom with tools outperforms pre-programmed pipelines by 10.7% (Tsinghua, arxiv 2603.01853). The solution is not less autonomy. It is better enforcement around autonomous agents.

The Root Cause: Prose Enforcement Fails Under Pressure

Every AI team writes rules in markdown files. "Never modify production config." "Always run tests before committing." "Use the existing patterns."

These are suggestions, not enforcement. When the context window fills up -- and it always does -- the model drops these rules first. They are the lowest-priority tokens in the window. The agent does not intentionally violate them; it simply forgets they exist.

This is the structural failure mode of every AI coding setup that relies on prompts alone. The prevent-by-construction alternative encodes rules at levels the model cannot forget.

The Enforcement Ladder: L1 Through L5

The fix is a hierarchy. Each level compounds on the one below:

L1 -- Conversation. "Hey, don't do that." Works once. Forgotten by the next session.

L2 -- Prose documentation. CLAUDE.md rules, README instructions. Better than conversation. Still dropped under context pressure. 3,706 violations tracked in our system started as L2 rules.

L3 -- Templates. Code templates, CI/CD configs, project scaffolds. The right pattern is the easy path. Violations happen when agents go off-template.

L4 -- Tests. Automated test suites that catch violations at commit time. The agent cannot merge if the test fails. This is where enforcement becomes structural.

L5 -- Hooks. Pre-commit hooks, pre-tool-use hooks, runtime guards. The action is physically prevented before it happens. Zero awareness required from the agent or developer.

The principle: every lesson must be encoded where enforcement requires zero awareness. Prose means failure -- justify why structural enforcement is impossible before writing it.

How It Works in Practice

A rule like "never write to the production database" starts at L2 (documented in CLAUDE.md). The first violation gets caught in code review. The second time, it gets promoted to L4 (a test that checks database connection strings). The third time -- there is no third time, because it is now an L5 hook that blocks the commit.

Each promotion reduces the violation surface. The system literally optimizes its own quality.

We shipped 26 specs autonomously with this approach. 960+ commits across two repos. 3,706 violations tracked, diagnosed, and encoded. The enforcement ladder does not slow agents down -- it makes their autonomy safe.

What This Means for Your Team

If you are using AI coding agents today, ask yourself:

How many of your rules are L2 prose? (Most are.)
How many violations have you tracked? (Probably zero -- you are not counting.)
What happens when your agent fills its context window? (Your rules disappear.)

The fix is not more prompts. It is structural enforcement at L4-L5 for your most critical rules.

Get a Free Repo Scan

Run your repository through our free repo scan to see exactly where your enforcement gaps are. No signup required.

Scan your repo now at walseth.ai/scan

Doug Walseth builds autonomous AI agent systems with built-in enforcement. His converge methodology has been applied to 3 production codebases with 3,706 violations tracked and encoded.

Citations:

SWE-CI: arxiv.org/abs/2603.03823
Anthropic skill erosion: arxiv.org/abs/2601.20245
Tsinghua autonomous search: arxiv.org/abs/2603.01853

Reading Path

Keep the next move clear after this article

Run the free repo scan on any public repository to get a quick signal before you buy deeper work.

This post is explanation or saved context, not current findings for your repo. Use the proof page and product path below instead of stopping at the article.

State right now: this article is explanation or saved evidence for one topic, not Walseth AI's proof page and not current findings for your repo by itself.

Next step: read /proof when you need Walseth AI's current measured proof, or run the free repo scan when you need current public-repo findings before a paid follow-through.

Operating record

See Walseth AI's current measured proof

This article explains the model or preserves saved context. The proof page holds Walseth AI's current measured proof.

Repo findings

Run the free scan on your own public repository

Use the free scan when this post makes you ask what your own repo looks like right now instead of staying at explanation or saved examples.

Paid follow-through

Use the baseline sprint when the signal is already real

Choose the baseline sprint after the free scan or an equivalent repo signal confirms a real gap and you need remediation order.

View Proof Page Run Free Repo Scan Request Baseline Sprint

Current article CTA

This post's direct CTA still points to the most relevant next surface for this topic.

Run Free Repo Scan

Get AI Governance Insights

Practical takes on enforcement automation and EU AI Act readiness. No spam.

Newsletter only

What happens

Email updates only

Submitting adds this address to future newsletter sends only.

What it does not do

No service request

It does not start a scan, open a paid lane, or trigger a private follow-up.

If you need help now

Use the right path

Run the free repo scan for current public-repo signal. Request baseline review if the issue is already real.

Framework Governance Scores

See how major AI/ML frameworks score on enforcement posture, context hygiene, and EU AI Act readiness.

View all scores →

Want to know where your AI governance stands?

Get a Free Governance Audit