Autoresearch is Mainstream. Now Make It Production-Grade.
Karpathy proved the concept. Here's what production-grade looks like.
Andrej Karpathy left an agent running on nanochat for two days. It tested roughly 700 changes, found about 20 improvements, and reduced training time by 11% -- all without human intervention. The agent tuned attention mechanisms, adjusted optimizer schedules, modified RoPE theta values, and refined initialization parameters.
No human intervened. No manual hyperparameter sweep. An autonomous agent improved a codebase by iterating, evaluating, and keeping winners.
The post went viral. 225,000+ views. And the reaction was immediate and polarized.
"This is Just Crude Hill Climbing"
Competitive programming practitioner @FakePsyho nailed the technical critique: "This is just a simple hill climbing, so it's an extremely crude version of AlphaEvolve / OpenAI's AWTF scaffold / ALE-Agent. But it's also a good reminder that non-bleeding-edge ML is just throwing random shit at the wall and seeing what sticks."
He's right. And that's exactly why this matters.
If crude hill climbing -- the simplest possible search strategy -- delivers 11% improvements with zero human oversight, then the ceiling for sophisticated autoresearch is very high. Karpathy demonstrated the floor.
The upgrade spectrum is well-understood:
- Hill climbing (Karpathy): Try one change, evaluate, keep if better. Simple. Gets stuck in local optima.
- Population-based search: Maintain multiple candidate solutions in parallel. Crossover between winners. Diversity prevents local optima traps.
- Evolutionary optimization (AlphaEvolve): LLM-guided mutation, structured exploration, Pareto selection. DeepMind recovered 0.7% of Google's worldwide compute with this approach.
Karpathy is at step one. The question isn't whether autoresearch works -- he proved it does. The question is: what does a production-grade implementation look like?
The Three Missing Pieces
Having built and operated autonomous agents that execute specs, run optimization loops, and accumulate learnings across hundreds of iterations, we've identified three capabilities that separate crude hill climbing from production-grade autoresearch.
1. Enforcement: Prevent Regressions Before They Waste Iterations
In crude hill climbing, every iteration starts from scratch. The agent doesn't know what failed before. It can re-introduce a bug that was caught three iterations ago. It can undo an improvement from fifty iterations back.
The enforcement ladder solves this by encoding operational lessons at five progressively stronger levels -- from prose rules that the agent may forget, to pre-commit hooks that block violations before they can be evaluated.
Each lesson gets encoded at the deepest level the system supports. A Level 5 hook that blocks a bad mutation before evaluation is infinitely cheaper than discovering the regression after a full evaluation cycle.
The ROI framing: small verification overhead eliminates wasted iterations. Like adding a quality gate to a manufacturing line -- the gate costs seconds, but catching defects early saves hours of downstream rework.
2. Convergence Verification: Know When Improvements Are Real
Karpathy reported 11% improvement. But how confident is that number?
In autonomous optimization, noise is the enemy. A result that looks like a 3% improvement might be measurement variance. An agent that accepts noisy improvements is doing a random walk, not optimization.
Production-grade autoresearch needs structured convergence verification:
- Statistical significance testing on evaluation results before accepting changes
- Regression detection across the full metric surface, not just the target metric
- Stuck detection that identifies when the search has plateaued and needs a strategy change
- Anomaly detection that flags results that are too good (likely measurement artifacts)
Harvard researchers found that monitorability is orthogonal to capability. A more capable agent that produces better raw results is not automatically producing more trustworthy results. Verification must be structural, not assumed.
3. Skill Accumulation: Learnings That Compound Across Runs
Crude hill climbing is memoryless. Run 1 discovers that adjusting RoPE theta by 15% helps. Run 2 has no idea Run 1 happened.
Production-grade autoresearch accumulates skills:
- Skill extraction: After each successful iteration, the system extracts the pattern.
- Skill deposit: Patterns are stored in a structured skill library with version tracking and domain tagging.
- Skill injection: Future iterations start with accumulated skills injected into their proposal prompts, biasing the search toward proven patterns.
- Deprecation: Skills that stop performing get flagged and removed, preventing stale knowledge from poisoning the search.
This is experience replay applied to optimization. Simple, verified patterns outperform complex, untested ones -- exactly the enforcement ladder principle.
The Enterprise Gap
Karpathy can afford to let an agent run unsupervised on a personal research project. Enterprises cannot.
When autonomous optimization touches production codebases, the stakes change:
- A bad mutation that passes evaluation can reach production
- An unconstrained agent can introduce security vulnerabilities while optimizing for performance
- Regressions in untested code paths accumulate silently
- No audit trail means no accountability
The gap between "works on a research project" and "works inside an enterprise without breaking things" is the deployment gap. Bridging it requires exactly the three capabilities above: enforcement that prevents dangerous mutations, convergence verification that ensures improvements are real, and skill accumulation that compounds reliability over time.
Consider the failure mode: an unconstrained autoresearch agent optimizing a code generation pipeline for speed. It discovers that removing input validation makes generation 15% faster. The evaluation metric improves. The change is accepted. Six weeks later, a prompt injection exploit reaches production because the validation was the constraint preventing it.
This is not hypothetical. It is the predictable outcome of optimization without enforcement. Every optimization that improves the target metric by removing a safety constraint is a regression masquerading as progress.
The Compounding Effect
The real power of production-grade autoresearch isn't any single component -- it's the interaction between all three.
Enforcement prevents bad mutations from wasting evaluation cycles. Convergence verification ensures good mutations are genuinely good. Skill accumulation means each run starts from a higher baseline than the last.
Over time, this creates a flywheel. Run 1 discovers 3 improvements and encodes them as skills. Run 2 starts with those skills, skips the mistakes Run 1 already caught, and discovers 4 more improvements in the same number of iterations. Run 3 has 7 skills and an expanding enforcement surface.
Karpathy's 700 iterations over 2 days found 20 improvements. That's a ~3% hit rate. With skill accumulation biasing the search toward proven patterns and enforcement blocking known-bad mutations, the hit rate should improve with every run. The crude version delivers a constant hit rate. The production version delivers an increasing one.
What This Means For Your Team
If you're running AI agents in any capacity -- code generation, testing, optimization, deployment -- you're already doing a primitive form of autoresearch. Every prompt iteration, every configuration tweak, every "let me try a different approach" is hill climbing.
The question is whether you're compounding those learnings or starting from zero every time.
Three concrete steps:
-
Audit your constraint stack. How many of your agent constraints are prose (Level 1-2) vs. structural (Level 3-5)? If more than half your constraints are prose, you're relying on the agent reading and remembering -- and under context pressure, it won't.
-
Add verification gates. Don't accept improvements at face value. Test them against a broader metric surface. A 5% improvement on the target metric that causes a 2% regression on three other metrics is a net loss.
-
Start accumulating. Every time your agent discovers something that works, encode it. Not as a note in a document -- as a reusable pattern that future iterations can build on.
Karpathy proved autoresearch works. The next step is making it production-grade.
References: Sang et al., arXiv:2603.05433, 2026. Xiong et al., arXiv:2602.03978, 2026. Liu et al., arXiv:2603.03818, 2026. Chen et al., arXiv:2603.00718, 2026. Lv et al., arXiv:2603.01853, 2026.
We offer free AI governance audits for companies deploying AI in regulated industries. The audit runs our enforcement engine against your systems and produces a compliance gap report. No cost, no commitment. Just data.
Request Your Free Governance AuditGet AI Governance Insights
Practical takes on enforcement automation and EU AI Act readiness. No spam.
Related Articles
The Enforcement Ladder: What 4 Labs Converged On (And the Layer They Skipped)
Four AI labs independently built the same agent architecture. None of them built the governance layer. The enforcement ladder is the missing piece that turns 75% regression rates into less than 5%.
7 min readThree Principles That Separate AI Agents That Ship From AI Agents That Don't
Token fungibility, the inverted 80/20, and clarity precedes execution. Three frameworks from Nate Jones' convergence thesis that explain why 94% of AI agent projects never reach production.
4 min readYour Next AI Agent Should Cost $0 to Train
Fine-tuned domain agents on consumer hardware. Unsloth + Qwen3.5-4B dropped fine-tuning to 5GB VRAM. The economics of custom AI agents just changed.
5 min readFramework Governance Scores
See how major AI/ML frameworks score on enforcement posture, context hygiene, and EU AI Act readiness.
Want to know where your AI governance stands?
Get a Free Governance Audit