LLM Reasoning Is Fake — That's Why You Need Enforcement

March 15, 20266 min readEnforcement & Governance

Pedro Domingos, author of The Master Algorithm and one of the most cited researchers in machine learning, said it plainly: "Almost all LLM reasoning is fake."

He is right. And that single fact is the strongest argument for structural enforcement I have ever encountered.

The Skeptic Has a Point

Let me be clear about what "fake reasoning" means. When an LLM produces a chain-of-thought, it is not deducing from first principles. It is pattern-matching against statistical regularities in its training data. Recent work by Shi Weiyan and collaborators found that 97%+ of LLM "thinking" steps are decorative -- they do not actually contribute to the final answer. The model arrives at its conclusion through learned associations, then backfills plausible-looking reasoning to connect the dots.

This is not controversial anymore. It is the consensus view among researchers who study these systems closely. The chain-of-thought is scaffolding that was never load-bearing.

So the skeptics who say "LLMs cannot reason" are correct. But they draw exactly the wrong conclusion from that fact. They say: therefore AI coding tools are unreliable, therefore we should not trust them, therefore we should go back to doing everything manually.

That conclusion ignores a $150K salary gap between engineers who use AI tools effectively and those who do not. It ignores the reality that every major software company is deploying these systems right now. The question is not whether to use AI agents. The question is how to use them without getting burned.

Better Prompts Are Not the Answer

The instinct when confronted with unreliable reasoning is to write better prompts. Add more context. Use few-shot examples. Tell the model to "think step by step." This is the prose layer of enforcement -- Level 2 on the ladder -- and it is the weakest form of control you can exercise.

Why? Because prompts are suggestions. They are processed by the same pattern-matching engine that produces the unreliable reasoning in the first place. You are asking the system to be more careful using the same mechanism that makes it careless. ReVeal, a framework published as arxiv 2506.11442, demonstrates that self-verification is what actually drives self-improvement in language models. But self-verification only works when there is a structural layer that catches failures and feeds them back into the loop. Without that layer, the model simply generates confident-sounding nonsense and moves on.

This is the core insight: you do not fix unreliable reasoning by improving the reasoning. You fix it by building verification structures around the reasoning.

The Enforcement Ladder

Consider what happens when an AI coding agent introduces a regression. According to SWE-CI, 75% of AI coding models introduce regressions when fixing bugs. Three out of four times, the "fix" breaks something else.

Without structural enforcement, here is what happens: the agent reasons its way through a bug fix, produces code that looks correct, writes a commit message explaining why it is correct, and pushes it. The reasoning was plausible. The code was wrong. You find out in production.

With structural enforcement, here is what happens: the agent produces the same plausible-but-wrong code. A pre-commit hook runs the test suite. The tests fail. The agent receives the failure, re-examines its work, and iterates. No human had to review anything. The structural layer caught what the reasoning layer missed.

This is not hypothetical. In a named-agents deployment running ACE, we tracked 3,706 violations caught by automated enforcement hooks. Each one was a case where an agent's reasoning produced output that violated a structural constraint -- and the system caught it before it reached production.

Anthropic's own research found 17% skill erosion among developers using AI tools without guardrails. The tools make you faster at producing code and simultaneously worse at evaluating it. That is a dangerous combination without a verification layer.

The Three-Part Architecture

The architecture that makes unreliable reasoning safe has three components, and all three are necessary.

The model reasons. It does so imperfectly, through pattern matching and statistical association. That is fine. Human reasoning is also imperfect, biased, and prone to error. We do not demand perfection from human engineers. We demand that their work passes review.

The harness enforces. Pre-commit hooks, test suites, linters, type checkers, schema validators -- these are structural enforcement layers that operate independently of the model's reasoning. They do not care why the code was written. They care whether it meets the specification. This is Level 5 enforcement: hooks that require zero awareness from the agent to function.

The context remembers. Persistent memory across sessions means that lessons learned from past failures survive context window limits. When an agent hits a violation, that knowledge is encoded structurally -- not as a prompt instruction that might be ignored, but as a test or hook that will fire automatically next time. The system gets more reliable with every use, compounding returns without requiring anyone to remember anything.

LangChain demonstrated this architecture in practice. They moved from Top 30 to Top 5 on Terminal Bench 2.0 with zero model changes. Same LLM. Same weights. Same "fake reasoning." Different harness. Different enforcement. Different results.

The Reasoning Is Fake. The Enforcement Is Real.

Without enforcement, you have a powerful pattern-matching engine generating plausible output with no structural verification. You are the guardrail. Your attention is the only thing between the agent's confident mistakes and your production environment. You will miss things. You will get tired. You will trust output that looks right.

With enforcement, the model's unreliable reasoning is wrapped in layers of structural verification that catch violations automatically. The enforcement ladder -- from hooks to tests to templates to configuration -- creates a system that improves with every failure. The 3,706 violations we tracked were not 3,706 disasters. They were 3,706 corrections that happened silently, without human intervention, before anything broke.

The reasoning is fake. The enforcement is real. That is the point.

Your AI agents are reasoning right now. The question is whether anything is checking their work. Run our free repo scan and find out what structural enforcement would catch in your pipeline.

Attribution: Pedro Domingos (@pmddomingos). Shi Weiyan -- research on decorative thinking steps. ReVeal: arxiv 2506.11442. SWE-CI: arxiv 2603.03823. Anthropic: arxiv 2601.20245.

Reading Path

Keep the next move clear after this article

Start with the free repo scan if you need a quick public-repo signal. Request the baseline sprint if you already know you need a bounded remediation plan.

This post is explanation or saved context, not current findings for your repo. Use the proof page and product path below instead of stopping at the article.

State right now: this article is explanation or saved evidence for one topic, not Walseth AI's proof page and not current findings for your repo by itself.

Next step: read /proof when you need Walseth AI's current measured proof, or run the free repo scan when you need current public-repo findings before a paid follow-through.

Operating record

See Walseth AI's current measured proof

This article explains the model or preserves saved context. The proof page holds Walseth AI's current measured proof.

Repo findings

Run the free scan on your own public repository

Use the free scan when this post makes you ask what your own repo looks like right now instead of staying at explanation or saved examples.

Paid follow-through

Use the baseline sprint when the signal is already real

Choose the baseline sprint after the free scan or an equivalent repo signal confirms a real gap and you need remediation order.

View Proof Page Run Free Repo Scan Request Baseline Sprint

Current article CTA

This post's direct CTA still points to the most relevant next surface for this topic.

Request Baseline Sprint

Get AI Governance Insights

Practical takes on enforcement automation and EU AI Act readiness. No spam.

Newsletter only

What happens

Email updates only

Submitting adds this address to future newsletter sends only.

What it does not do

No service request

It does not start a scan, open a paid lane, or trigger a private follow-up.

If you need help now

Use the right path

Run the free repo scan for current public-repo signal. Request baseline review if the issue is already real.

Framework Governance Scores

See how major AI/ML frameworks score on enforcement posture, context hygiene, and EU AI Act readiness.

View all scores →

Want to know where your AI governance stands?

Get a Free Governance Audit