Autoresearch is Mainstream. Now Make It Production-Grade.
Karpathy proved the concept. Here's what production-grade looks like.
Andrej Karpathy left an agent running on nanochat for two days. It tested roughly 700 changes, found about 20 improvements, and reduced training time by 11% -- all without human intervention. The agent tuned attention mechanisms, adjusted optimizer schedules, modified RoPE theta values, and refined initialization parameters.
No human intervened. No manual hyperparameter sweep. An autonomous agent improved a codebase by iterating, evaluating, and keeping winners.
The post went viral. 225,000+ views. And the reaction was immediate and polarized.
"This is Just Crude Hill Climbing"
Competitive programming practitioner @FakePsyho nailed the technical critique: "This is just a simple hill climbing, so it's an extremely crude version of AlphaEvolve / OpenAI's AWTF scaffold / ALE-Agent. But it's also a good reminder that non-bleeding-edge ML is just throwing random shit at the wall and seeing what sticks."
He's right. And that's exactly why this matters.
If crude hill climbing -- the simplest possible search strategy -- delivers 11% improvements with zero human oversight, then the ceiling for sophisticated autoresearch is very high. Karpathy demonstrated the floor.
The upgrade spectrum is well-understood:
- Hill climbing (Karpathy): Try one change, evaluate, keep if better. Simple. Gets stuck in local optima.
- Population-based search: Maintain multiple candidate solutions in parallel. Crossover between winners. Diversity prevents local optima traps.
- Evolutionary optimization (AlphaEvolve): LLM-guided mutation, structured exploration, Pareto selection. DeepMind recovered 0.7% of Google's worldwide compute with this approach.
Karpathy is at step one. The question isn't whether autoresearch works -- he proved it does. The question is: what does a production-grade implementation look like?
The Three Missing Pieces
Having built and operated autonomous agents that execute specs, run optimization loops, and accumulate learnings across hundreds of iterations, we've identified three capabilities that separate crude hill climbing from production-grade autoresearch.
1. Enforcement: Prevent Regressions Before They Waste Iterations
In crude hill climbing, every iteration starts from scratch. The agent doesn't know what failed before. It can re-introduce a bug that was caught three iterations ago. It can undo an improvement from fifty iterations back.
The enforcement ladder solves this by encoding operational lessons at five progressively stronger levels -- from prose rules that the agent may forget, to pre-commit hooks that block violations before they can be evaluated.
Each lesson gets encoded at the deepest level the system supports. A Level 5 hook that blocks a bad mutation before evaluation is infinitely cheaper than discovering the regression after a full evaluation cycle.
The ROI framing: small verification overhead eliminates wasted iterations. Like adding a quality gate to a manufacturing line -- the gate costs seconds, but catching defects early saves hours of downstream rework.
2. Convergence Verification: Know When Improvements Are Real
Karpathy reported 11% improvement. But how confident is that number?
In autonomous optimization, noise is the enemy. A result that looks like a 3% improvement might be measurement variance. An agent that accepts noisy improvements is doing a random walk, not optimization.
Production-grade autoresearch needs structured convergence verification:
- Statistical significance testing on evaluation results before accepting changes
- Regression detection across the full metric surface, not just the target metric
- Stuck detection that identifies when the search has plateaued and needs a strategy change
- Anomaly detection that flags results that are too good (likely measurement artifacts)
Harvard researchers found that monitorability is orthogonal to capability. A more capable agent that produces better raw results is not automatically producing more trustworthy results. Verification must be structural, not assumed.
3. Skill Accumulation: Learnings That Compound Across Runs
Crude hill climbing is memoryless. Run 1 discovers that adjusting RoPE theta by 15% helps. Run 2 has no idea Run 1 happened.
Production-grade autoresearch accumulates skills:
- Skill extraction: After each successful iteration, the system extracts the pattern.
- Skill deposit: Patterns are stored in a structured skill library with version tracking and domain tagging.
- Skill injection: Future iterations start with accumulated skills injected into their proposal prompts, biasing the search toward proven patterns.
- Deprecation: Skills that stop performing get flagged and removed, preventing stale knowledge from poisoning the search.
This is experience replay applied to optimization. Simple, verified patterns outperform complex, untested ones -- exactly the enforcement ladder principle.
The Enterprise Gap
Karpathy can afford to let an agent run unsupervised on a personal research project. Enterprises cannot.
When autonomous optimization touches production codebases, the stakes change:
- A bad mutation that passes evaluation can reach production
- An unconstrained agent can introduce security vulnerabilities while optimizing for performance
- Regressions in untested code paths accumulate silently
- No audit trail means no accountability
The gap between "works on a research project" and "works inside an enterprise without breaking things" is the deployment gap. Bridging it requires exactly the three capabilities above: enforcement that prevents dangerous mutations, convergence verification that ensures improvements are real, and skill accumulation that compounds reliability over time.
Consider the failure mode: an unconstrained autoresearch agent optimizing a code generation pipeline for speed. It discovers that removing input validation makes generation 15% faster. The evaluation metric improves. The change is accepted. Six weeks later, a prompt injection exploit reaches production because the validation was the constraint preventing it.
This is not hypothetical. It is the predictable outcome of optimization without enforcement. Every optimization that improves the target metric by removing a safety constraint is a regression masquerading as progress.
The Compounding Effect
The real power of production-grade autoresearch isn't any single component -- it's the interaction between all three.
Enforcement prevents bad mutations from wasting evaluation cycles. Convergence verification ensures good mutations are genuinely good. Skill accumulation means each run starts from a higher baseline than the last.
Over time, this creates a flywheel. Run 1 discovers 3 improvements and encodes them as skills. Run 2 starts with those skills, skips the mistakes Run 1 already caught, and discovers 4 more improvements in the same number of iterations. Run 3 has 7 skills and an expanding enforcement surface.
Karpathy's 700 iterations over 2 days found 20 improvements. That's a ~3% hit rate. With skill accumulation biasing the search toward proven patterns and enforcement blocking known-bad mutations, the hit rate should improve with every run. The crude version delivers a constant hit rate. The production version delivers an increasing one.
What This Means For Your Team
If you're running AI agents in any capacity -- code generation, testing, optimization, deployment -- you're already doing a primitive form of autoresearch. Every prompt iteration, every configuration tweak, every "let me try a different approach" is hill climbing.
The question is whether you're compounding those learnings or starting from zero every time.
Three concrete steps:
-
Audit your constraint stack. How many of your agent constraints are prose (Level 1-2) vs. structural (Level 3-5)? If more than half your constraints are prose, you're relying on the agent reading and remembering -- and under context pressure, it won't.
-
Add verification gates. Don't accept improvements at face value. Test them against a broader metric surface. A 5% improvement on the target metric that causes a 2% regression on three other metrics is a net loss.
-
Start accumulating. Every time your agent discovers something that works, encode it. Not as a note in a document -- as a reusable pattern that future iterations can build on.
Karpathy proved autoresearch works. The next step is making it production-grade.
References: Sang et al., arXiv:2603.05433, 2026. Xiong et al., arXiv:2602.03978, 2026. Liu et al., arXiv:2603.03818, 2026. Chen et al., arXiv:2603.00718, 2026. Lv et al., arXiv:2603.01853, 2026.
Proof Path
Keep the next move honest after this article
Start with the free repo scan if you need a quick public-repo signal. Request the baseline sprint if you already know you need a bounded remediation plan.
This post is explanation or saved evidence, not current findings for your repo. Use the proof and product path below instead of stopping at the article.
State right now: this article is explanation or saved evidence for one topic, not Walseth AI's proof page and not current findings for your repo by itself.
Next step: read /proof when you need Walseth AI's current measured proof, or run the free repo scan when you need current public-repo findings before a paid follow-through.
Measured proof
See Walseth AI's current operating proof
This article explains the model or preserves saved evidence. The proof page holds Walseth AI's current measured operating proof.
Repo findings
Run the free scan on your own public repository
Use the free scan when this post makes you ask what your own repo looks like right now instead of staying at explanation or saved examples.
Paid follow-through
Use the baseline sprint when the signal is already real
Choose the baseline sprint after the free scan or an equivalent repo signal confirms a real gap and you need remediation order.
Current article CTA
This post's direct CTA still points to the most relevant next surface for this topic.
Request Baseline SprintGet AI Governance Insights
Practical takes on enforcement automation and EU AI Act readiness. No spam.
Newsletter only
What happens
Email updates only
Submitting adds this address to future newsletter sends only.
What it does not do
No service request
It does not start a scan, open a paid lane, or trigger a private follow-up.
If you need help now
Use the right path
Run the free repo scan for current public-repo signal. Request baseline review if the issue is already real.
Related Articles
The Enforcement Ladder: What 4 Labs Converged On (And the Layer They Skipped)
Four AI labs independently built the same agent architecture. None of them built the governance layer. The enforcement ladder is the missing piece that turns 75% regression rates into less than 5%.
7 min readThree Principles That Separate AI Agents That Ship From AI Agents That Don't
Token fungibility, the inverted 80/20, and clarity precedes execution. Three frameworks from Nate Jones' convergence thesis that explain why 94% of AI agent projects never reach production.
4 min readYour Next AI Agent Should Cost $0 to Train
Fine-tuned domain agents on consumer hardware. Unsloth + Qwen3.5-4B dropped fine-tuning to 5GB VRAM. The economics of custom AI agents just changed.
5 min readFramework Governance Scores
See how major AI/ML frameworks score on enforcement posture, context hygiene, and EU AI Act readiness.
Want to know where your AI governance stands?
Get a Free Governance Audit