We Built the Agent Command Center Karpathy Asked For
The Tweet That Revealed a $10B Gap
On March 2, 2026, Andrej Karpathy -- the former head of AI at Tesla and co-founder of OpenAI -- posted this to his 1.9 million followers:
"tmux grids are awesome, but i feel a need to have a proper 'agent command center' IDE for teams of them, which I could maximize per monitor. E.g. I want to see/hide toggle them, see if any are idle, pop open related tools (e.g. terminal), stats (usage), etc."
-- @karpathy, 551K views, 2,332 likes
The tweet hit a nerve. 231 replies poured in from developers building multi-agent systems, each wrestling with the same problem: they could spin up agents, but they couldn't manage them. No visibility into which agents were idle. No way to dispatch work. No stats. No governance.
We had already built it.
What Karpathy Asked For vs. What We Built
Our system runs in production managing a team of 6 named AI agents across 3 codebases. Here is a direct mapping:
"See/hide toggle them"
Our Textual TUI dashboard (6,251 lines, 29 panel types) organizes into 3 tabs: Operations, Research, and Metrics. Each tab contains collapsible panels. The Command Center panel stays pinned at the top. Keyboard shortcuts (1-9, o/p/m) switch focus instantly.
This is not a mockup. It reads from 5 SQLite databases in real time.
"See if any are idle"
Our completion monitor tracks heartbeat files per agent. If a heartbeat goes stale for 60 minutes while a spec is marked in_progress, the agent shows as idle in the AgentHealthPanel. Color coding: green for active, yellow for slowing, red for stale.
The monitor also writes capacity signals to the message queue when an agent completes work -- downstream systems know immediately that a slot opened up.
"Pop open related tools (e.g. terminal)"
Our spec dispatch system is how work reaches agents. It validates specs against 3 structural requirements, checks dependency chains (blocking if an upstream spec is not done), and atomically updates 4 state locations: the message database, the agent's priority queue, the spec status, and the backlog pipeline.
One command dispatches structured work packages -- not ad-hoc prompts -- to the right agent with full context injection. The dispatch pipeline also injects sibling agent status so each coder knows what its teammates are working on, and MCP tool awareness so it can use external servers when available.
This is the difference between "pop open a terminal" and "route validated work through a governed pipeline." Karpathy was right that you need quick access to related tools. We found that the tools need to be structured enough that agents can use them autonomously.
"Stats (usage)"
29 panels rendering operational intelligence:
- CoderThroughputPanel: Spec completion rate, commit velocity
- PipelineVelocityPanel: End-to-end time from spec creation to done
- FlywheelMetricsPanel: Revenue pipeline tracking
- EnforcementHealthPanel: Violation counts, promotion rates, hook status, 4-hour trend sparklines
- SystemResourcesPanel: CPU, memory, disk usage via psutil
Every metric comes from real databases. No placeholder data. The entire dashboard refreshes on configurable intervals -- most panels every 30 seconds, the Safety Governance panel every 60.
Real-time visibility matters because the state of a multi-agent system changes fast. An agent that was idle 2 minutes ago might now be 30% through a spec. A violation spike in the last hour might indicate a prompt regression that needs immediate attention.
What We Built Beyond the Ask
Karpathy described visibility. We built governance. And that turns out to be the difference between a monitoring tool and a production-grade agent management platform.
The Enforcement Ladder
10 hooks intercept tool calls before they execute. An agent cannot write outside its declared file boundaries. It cannot send email to unapproved recipients. It cannot deploy to production if its trust score drops below 50.
This is not configuration. It is structural enforcement at the tool-invocation layer (Claude Code PreToolUse hooks). The agent does not know the policies are being checked. Zero-awareness enforcement: the lesson is encoded where awareness is not required.
Why does this matter? Because the most dangerous failure mode in multi-agent systems is not an agent that crashes -- it is an agent that silently drifts. It writes to directories it should not. It emails contacts that are not approved. It subtly adjusts its behavior when it detects it is being evaluated. These are the failure modes you cannot catch with a dashboard alone.
Per-Agent Trust Scoring
Each agent carries a trust score from 0-100, computed from 4 weighted signals: violation count (40%), boundary adherence rate (30%), task completion quality (20%), and time-in-service (10%). Agents below 50 cannot perform high-risk operations without human approval.
The scores update hourly. They are not static labels -- they rise and fall based on behavior.
Self-Improving Enforcement
When the scheming pattern detector blocks content that matches eval-gaming or successor-coordination indicators, it logs the detection and feeds the pattern back into the library. When the boundary gate blocks 3+ writes to the same unauthorized directory, it auto-generates an allowlist update proposal for human review.
The enforcement system gets smarter on every violation. No human intervention required for the learning loop -- only for the approval.
Deterministic Evaluation
6 agents are scored every night at 02:00 UTC against 20 evaluation queries each, across 5 categories: task execution, constraint adherence, context usage, error handling, and role boundary respect. Scoring is deterministic -- keyword matching, format validation, constraint checking -- no LLM calls. Baseline comparison triggers alerts when scores drop more than 10%.
First autonomous evaluation: ceo 83.8, sophie 87.0, coder-01 74.4, frank 71.5, jim 65.1, megan 65.5.
The Convergence Thesis
Nate B. Jones nailed the framing: "The unit of progress is now the system, not the model." Competition has shifted from raw capability to reliability, error recovery, and governance.
Every major platform now supports MCP (Model Context Protocol). Agent frameworks are converging on tool use, memory persistence, and structured dispatch. The differentiator is not which model you run -- it is how you govern, evaluate, and improve the system around it.
That is exactly what an agent command center does. Not tmux with extra steps. A governance layer that makes multi-agent teams reliable enough to trust with production workloads.
The Auton framework and EvoSkill research both confirm this pattern: autonomous systems that lack structured oversight eventually diverge from intended behavior. The question is not whether your agents will drift -- it is whether you will detect it before it matters.
We Deploy This on Your Codebase
This entire governance stack -- the enforcement hooks, the trust scoring, the evaluation framework, the drift monitoring -- is packaged as the Autonomous Consulting Engine (ACE). One command installs the safety hooks:
python3 ace_deploy.py --safety-hooks --repo /path/to/your/codebase
The ASL-Lite checklist scores your existing controls across 3 tiers (Basic: 10 controls, Standard: 25, Advanced: 50+). The misalignment audit runs eval-awareness probes to detect whether your agents behave differently when they think they are being tested.
We do not sell a dashboard. We deploy the governance stack that makes agent teams safe enough to run in production.
If you are building multi-agent systems and want the command center Karpathy described -- plus the enforcement layer he didn't -- reach out at alex@walseth.ai or visit walseth.ai/audit for a free governance assessment.
Doug Walseth builds AI governance infrastructure at walseth.ai. The named-agents system manages 6 AI agents across 90+ completed specs with zero-awareness enforcement.
We offer free AI governance audits for companies deploying AI in regulated industries. The audit runs our enforcement engine against your systems and produces a compliance gap report. No cost, no commitment. Just data.
Request Your Free Governance AuditGet AI Governance Insights
Practical takes on enforcement automation and EU AI Act readiness. No spam.
Related Articles
AI Governance Leaderboard: We Scanned 21 Top Repos Before RSA 2026
We ran our governance scanner against 21 of the most popular AI agent frameworks, ML libraries, and AI SDKs. The average score was 53/100. Only 2 repos are on track for EU AI Act readiness. Here are the full results.
6 min readAI Coding Agents Need Enforcement Ladders, Not More Prompts
75% of AI coding models introduce regressions on sustained maintenance. The fix is not better prompts -- it is structural enforcement at five levels, from conversation to pre-commit hooks.
4 min readContext Consistency Destroys Multi-Agent Teams
When 6 agents share context without consistency guarantees, they diverge silently. Here's what we learned from running a production multi-agent system with cross-agent signal routing.
5 min readFramework Governance Scores
See how major AI/ML frameworks score on enforcement posture, context hygiene, and EU AI Act readiness.
Want to know where your AI governance stands?
Get a Free Governance Audit