LLM Reasoning Is Fake — That's Why You Need Enforcement
Pedro Domingos, author of The Master Algorithm and one of the most cited researchers in machine learning, said it plainly: "Almost all LLM reasoning is fake."
He is right. And that single fact is the strongest argument for structural enforcement I have ever encountered.
The Skeptic Has a Point
Let me be clear about what "fake reasoning" means. When an LLM produces a chain-of-thought, it is not deducing from first principles. It is pattern-matching against statistical regularities in its training data. Recent work by Shi Weiyan and collaborators found that 97%+ of LLM "thinking" steps are decorative -- they do not actually contribute to the final answer. The model arrives at its conclusion through learned associations, then backfills plausible-looking reasoning to connect the dots.
This is not controversial anymore. It is the consensus view among researchers who study these systems closely. The chain-of-thought is scaffolding that was never load-bearing.
So the skeptics who say "LLMs cannot reason" are correct. But they draw exactly the wrong conclusion from that fact. They say: therefore AI coding tools are unreliable, therefore we should not trust them, therefore we should go back to doing everything manually.
That conclusion ignores a $150K salary gap between engineers who use AI tools effectively and those who do not. It ignores the reality that every major software company is deploying these systems right now. The question is not whether to use AI agents. The question is how to use them without getting burned.
Better Prompts Are Not the Answer
The instinct when confronted with unreliable reasoning is to write better prompts. Add more context. Use few-shot examples. Tell the model to "think step by step." This is the prose layer of enforcement -- Level 2 on the ladder -- and it is the weakest form of control you can exercise.
Why? Because prompts are suggestions. They are processed by the same pattern-matching engine that produces the unreliable reasoning in the first place. You are asking the system to be more careful using the same mechanism that makes it careless. ReVeal, a framework published as arxiv 2506.11442, demonstrates that self-verification is what actually drives self-improvement in language models. But self-verification only works when there is a structural layer that catches failures and feeds them back into the loop. Without that layer, the model simply generates confident-sounding nonsense and moves on.
This is the core insight: you do not fix unreliable reasoning by improving the reasoning. You fix it by building verification structures around the reasoning.
The Enforcement Ladder
Consider what happens when an AI coding agent introduces a regression. According to SWE-CI, 75% of AI coding models introduce regressions when fixing bugs. Three out of four times, the "fix" breaks something else.
Without structural enforcement, here is what happens: the agent reasons its way through a bug fix, produces code that looks correct, writes a commit message explaining why it is correct, and pushes it. The reasoning was plausible. The code was wrong. You find out in production.
With structural enforcement, here is what happens: the agent produces the same plausible-but-wrong code. A pre-commit hook runs the test suite. The tests fail. The agent receives the failure, re-examines its work, and iterates. No human had to review anything. The structural layer caught what the reasoning layer missed.
This is not hypothetical. In a named-agents deployment running ACE, we tracked 3,706 violations caught by automated enforcement hooks. Each one was a case where an agent's reasoning produced output that violated a structural constraint -- and the system caught it before it reached production.
Anthropic's own research found 17% skill erosion among developers using AI tools without guardrails. The tools make you faster at producing code and simultaneously worse at evaluating it. That is a dangerous combination without a verification layer.
The Three-Part Architecture
The architecture that makes unreliable reasoning safe has three components, and all three are necessary.
The model reasons. It does so imperfectly, through pattern matching and statistical association. That is fine. Human reasoning is also imperfect, biased, and prone to error. We do not demand perfection from human engineers. We demand that their work passes review.
The harness enforces. Pre-commit hooks, test suites, linters, type checkers, schema validators -- these are structural enforcement layers that operate independently of the model's reasoning. They do not care why the code was written. They care whether it meets the specification. This is Level 5 enforcement: hooks that require zero awareness from the agent to function.
The context remembers. Persistent memory across sessions means that lessons learned from past failures survive context window limits. When an agent hits a violation, that knowledge is encoded structurally -- not as a prompt instruction that might be ignored, but as a test or hook that will fire automatically next time. The system gets more reliable with every use, compounding returns without requiring anyone to remember anything.
LangChain demonstrated this architecture in practice. They moved from Top 30 to Top 5 on Terminal Bench 2.0 with zero model changes. Same LLM. Same weights. Same "fake reasoning." Different harness. Different enforcement. Different results.
The Reasoning Is Fake. The Enforcement Is Real.
Without enforcement, you have a powerful pattern-matching engine generating plausible output with no structural verification. You are the guardrail. Your attention is the only thing between the agent's confident mistakes and your production environment. You will miss things. You will get tired. You will trust output that looks right.
With enforcement, the model's unreliable reasoning is wrapped in layers of structural verification that catch violations automatically. The enforcement ladder -- from hooks to tests to templates to configuration -- creates a system that improves with every failure. The 3,706 violations we tracked were not 3,706 disasters. They were 3,706 corrections that happened silently, without human intervention, before anything broke.
The reasoning is fake. The enforcement is real. That is the point.
Your AI agents are reasoning right now. The question is whether anything is checking their work. Run our free governance scanner and find out what structural enforcement would catch in your pipeline.
Attribution: Pedro Domingos (@pmddomingos). Shi Weiyan -- research on decorative thinking steps. ReVeal: arxiv 2506.11442. SWE-CI: arxiv 2603.03823. Anthropic: arxiv 2601.20245.
We offer free AI governance audits for companies deploying AI in regulated industries. The audit runs our enforcement engine against your systems and produces a compliance gap report. No cost, no commitment. Just data.
Request Your Free Governance AuditGet AI Governance Insights
Practical takes on enforcement automation and EU AI Act readiness. No spam.
Related Articles
Your AI Agent Forgets Its Rules Every 45 Minutes — Here's the Fix
Every long-running AI agent hits context compression. Your system prompts, project rules, and behavioral constraints get silently dropped. Here's a production-proven hook that flushes critical knowledge to persistent storage before compression hits.
5 min readYour Context Is Poisoned
4,768 violations across 6 autonomous agents exposed 4 context failure modes. Here's what poisoned context looks like in production and how structural enforcement prevents it.
4 min readAI Governance Leaderboard: We Scanned 21 Top Repos Before RSA 2026
We ran our governance scanner against 21 of the most popular AI agent frameworks, ML libraries, and AI SDKs. The average score was 53/100. Only 2 repos are on track for EU AI Act readiness. Here are the full results.
6 min readFramework Governance Scores
See how major AI/ML frameworks score on enforcement posture, context hygiene, and EU AI Act readiness.
Want to know where your AI governance stands?
Get a Free Governance Audit