We Built the Agent Command Center Karpathy Asked For

March 15, 20267 min readEnforcement & Governance

The Tweet That Revealed a $10B Gap

On March 2, 2026, Andrej Karpathy -- the former head of AI at Tesla and co-founder of OpenAI -- posted this to his 1.9 million followers:

"tmux grids are awesome, but i feel a need to have a proper 'agent command center' IDE for teams of them, which I could maximize per monitor. E.g. I want to see/hide toggle them, see if any are idle, pop open related tools (e.g. terminal), stats (usage), etc."

-- @karpathy, 551K views, 2,332 likes

The tweet hit a nerve. 231 replies poured in from developers building multi-agent systems, each wrestling with the same problem: they could spin up agents, but they couldn't manage them. No visibility into which agents were idle. No way to dispatch work. No stats. No governance.

We had already built it.

What Karpathy Asked For vs. What We Built

Our system runs in production managing a team of 6 named AI agents across 3 codebases. Here is a direct mapping:

"See/hide toggle them"

Our Textual TUI dashboard (6,251 lines, 29 panel types) organizes into 3 tabs: Operations, Research, and Metrics. Each tab contains collapsible panels. The Command Center panel stays pinned at the top. Keyboard shortcuts (1-9, o/p/m) switch focus instantly.

This is not a mockup. It reads from 5 SQLite databases in real time.

"See if any are idle"

Our completion monitor tracks heartbeat files per agent. If a heartbeat goes stale for 60 minutes while a spec is marked in_progress, the agent shows as idle in the AgentHealthPanel. Color coding: green for active, yellow for slowing, red for stale.

The monitor also writes capacity signals to the message queue when an agent completes work -- downstream systems know immediately that a slot opened up.

"Pop open related tools (e.g. terminal)"

Our spec dispatch system is how work reaches agents. It validates specs against 3 structural requirements, checks dependency chains (blocking if an upstream spec is not done), and atomically updates 4 state locations: the message database, the agent's priority queue, the spec status, and the backlog pipeline.

One command dispatches structured work packages -- not ad-hoc prompts -- to the right agent with full context injection. The dispatch pipeline also injects sibling agent status so each coder knows what its teammates are working on, and MCP tool awareness so it can use external servers when available.

This is the difference between "pop open a terminal" and "route validated work through a governed pipeline." Karpathy was right that you need quick access to related tools. We found that the tools need to be structured enough that agents can use them autonomously.

"Stats (usage)"

29 panels rendering operational intelligence:

CoderThroughputPanel: Spec completion rate, commit velocity
PipelineVelocityPanel: End-to-end time from spec creation to done
FlywheelMetricsPanel: Revenue pipeline tracking
EnforcementHealthPanel: Violation counts, promotion rates, hook status, 4-hour trend sparklines
SystemResourcesPanel: CPU, memory, disk usage via psutil

Every metric comes from real databases. No placeholder data. The entire dashboard refreshes on configurable intervals -- most panels every 30 seconds, the Safety Governance panel every 60.

Real-time visibility matters because the state of a multi-agent system changes fast. An agent that was idle 2 minutes ago might now be 30% through a spec. A violation spike in the last hour might indicate a prompt regression that needs immediate attention.

What We Built Beyond the Ask

Karpathy described visibility. We built governance. And that turns out to be the difference between a monitoring tool and a production-grade agent management platform.

The Enforcement Ladder

10 hooks intercept tool calls before they execute. An agent cannot write outside its declared file boundaries. It cannot send email to unapproved recipients. It cannot deploy to production if its trust score drops below 50.

This is not configuration. It is structural enforcement at the tool-invocation layer (Claude Code PreToolUse hooks). The agent does not know the policies are being checked. Zero-awareness enforcement: the lesson is encoded where awareness is not required.

Why does this matter? Because the most dangerous failure mode in multi-agent systems is not an agent that crashes -- it is an agent that silently drifts. It writes to directories it should not. It emails contacts that are not approved. It subtly adjusts its behavior when it detects it is being evaluated. These are the failure modes you cannot catch with a dashboard alone.

Per-Agent Trust Scoring

Each agent carries a trust score from 0-100, computed from 4 weighted signals: violation count (40%), boundary adherence rate (30%), task completion quality (20%), and time-in-service (10%). Agents below 50 cannot perform high-risk operations without human approval.

The scores update hourly. They are not static labels -- they rise and fall based on behavior.

Self-Improving Enforcement

When the scheming pattern detector blocks content that matches eval-gaming or successor-coordination indicators, it logs the detection and feeds the pattern back into the library. When the boundary gate blocks 3+ writes to the same unauthorized directory, it auto-generates an allowlist update proposal for human review.

The enforcement system gets smarter on every violation. No human intervention required for the learning loop -- only for the approval.

Deterministic Evaluation

6 agents are scored every night at 02:00 UTC against 20 evaluation queries each, across 5 categories: task execution, constraint adherence, context usage, error handling, and role boundary respect. Scoring is deterministic -- keyword matching, format validation, constraint checking -- no LLM calls. Baseline comparison triggers alerts when scores drop more than 10%.

First autonomous evaluation: ceo 83.8, sophie 87.0, coder-01 74.4, frank 71.5, jim 65.1, megan 65.5.

The Convergence Thesis

Nate B. Jones nailed the framing: "The unit of progress is now the system, not the model." Competition has shifted from raw capability to reliability, error recovery, and governance.

Every major platform now supports MCP (Model Context Protocol). Agent frameworks are converging on tool use, memory persistence, and structured dispatch. The differentiator is not which model you run -- it is how you govern, evaluate, and improve the system around it.

That is exactly what an agent command center does. Not tmux with extra steps. A governance layer that makes multi-agent teams reliable enough to trust with production workloads.

The Auton framework and EvoSkill research both confirm this pattern: autonomous systems that lack structured oversight eventually diverge from intended behavior. The question is not whether your agents will drift -- it is whether you will detect it before it matters.

We Deploy This on Your Codebase

This entire governance stack -- the enforcement hooks, the trust scoring, the evaluation framework, the drift monitoring -- is packaged as the Autonomous Consulting Engine (ACE). One command installs the safety hooks:

python3 ace_deploy.py --safety-hooks --repo /path/to/your/codebase

The ASL-Lite checklist scores your existing controls across 3 tiers (Basic: 10 controls, Standard: 25, Advanced: 50+). The misalignment audit runs eval-awareness probes to detect whether your agents behave differently when they think they are being tested.

We do not sell a dashboard. We deploy the governance stack that makes agent teams safe enough to run in production.

If you are building multi-agent systems and want the command center Karpathy described -- plus the enforcement layer he didn't -- reach out at doug@walseth.ai or visit walseth.ai/audit to request baseline sprint fit review.

Doug Walseth builds AI governance infrastructure at walseth.ai. The named-agents system manages 6 AI agents across 90+ completed specs with zero-awareness enforcement.

Reading Path

Keep the next move clear after this article

Start with the free repo scan if you need a quick public-repo signal. Request the baseline sprint if you already know you need a bounded remediation plan.

This post is explanation or saved context, not current findings for your repo. Use the proof page and product path below instead of stopping at the article.

State right now: this article is explanation or saved evidence for one topic, not Walseth AI's proof page and not current findings for your repo by itself.

Next step: read /proof when you need Walseth AI's current measured proof, or run the free repo scan when you need current public-repo findings before a paid follow-through.

Operating record

See Walseth AI's current measured proof

This article explains the model or preserves saved context. The proof page holds Walseth AI's current measured proof.

Repo findings

Run the free scan on your own public repository

Use the free scan when this post makes you ask what your own repo looks like right now instead of staying at explanation or saved examples.

Paid follow-through

Use the baseline sprint when the signal is already real

Choose the baseline sprint after the free scan or an equivalent repo signal confirms a real gap and you need remediation order.

View Proof Page Run Free Repo Scan Request Baseline Sprint

Current article CTA

This post's direct CTA still points to the most relevant next surface for this topic.

Request Baseline Sprint

Get AI Governance Insights

Practical takes on enforcement automation and EU AI Act readiness. No spam.

Newsletter only

What happens

Email updates only

Submitting adds this address to future newsletter sends only.

What it does not do

No service request

It does not start a scan, open a paid lane, or trigger a private follow-up.

If you need help now

Use the right path

Run the free repo scan for current public-repo signal. Request baseline review if the issue is already real.

Framework Governance Scores

See how major AI/ML frameworks score on enforcement posture, context hygiene, and EU AI Act readiness.

View all scores →

Want to know where your AI governance stands?

Get a Free Governance Audit