EU AI Act enforcement begins August 2, 2026 — Are you ready?

We Built the Agent Command Center Karpathy Asked For

7 min readEnforcement & Governance

The Tweet That Revealed a $10B Gap

On March 2, 2026, Andrej Karpathy -- the former head of AI at Tesla and co-founder of OpenAI -- posted this to his 1.9 million followers:

"tmux grids are awesome, but i feel a need to have a proper 'agent command center' IDE for teams of them, which I could maximize per monitor. E.g. I want to see/hide toggle them, see if any are idle, pop open related tools (e.g. terminal), stats (usage), etc."

-- @karpathy, 551K views, 2,332 likes

The tweet hit a nerve. 231 replies poured in from developers building multi-agent systems, each wrestling with the same problem: they could spin up agents, but they couldn't manage them. No visibility into which agents were idle. No way to dispatch work. No stats. No governance.

We had already built it.

What Karpathy Asked For vs. What We Built

Our system runs in production managing a team of 6 named AI agents across 3 codebases. Here is a direct mapping:

"See/hide toggle them"

Our Textual TUI dashboard (6,251 lines, 29 panel types) organizes into 3 tabs: Operations, Research, and Metrics. Each tab contains collapsible panels. The Command Center panel stays pinned at the top. Keyboard shortcuts (1-9, o/p/m) switch focus instantly.

This is not a mockup. It reads from 5 SQLite databases in real time.

"See if any are idle"

Our completion monitor tracks heartbeat files per agent. If a heartbeat goes stale for 60 minutes while a spec is marked in_progress, the agent shows as idle in the AgentHealthPanel. Color coding: green for active, yellow for slowing, red for stale.

The monitor also writes capacity signals to the message queue when an agent completes work -- downstream systems know immediately that a slot opened up.

"Pop open related tools (e.g. terminal)"

Our spec dispatch system is how work reaches agents. It validates specs against 3 structural requirements, checks dependency chains (blocking if an upstream spec is not done), and atomically updates 4 state locations: the message database, the agent's priority queue, the spec status, and the backlog pipeline.

One command dispatches structured work packages -- not ad-hoc prompts -- to the right agent with full context injection. The dispatch pipeline also injects sibling agent status so each coder knows what its teammates are working on, and MCP tool awareness so it can use external servers when available.

This is the difference between "pop open a terminal" and "route validated work through a governed pipeline." Karpathy was right that you need quick access to related tools. We found that the tools need to be structured enough that agents can use them autonomously.

"Stats (usage)"

29 panels rendering operational intelligence:

  • CoderThroughputPanel: Spec completion rate, commit velocity
  • PipelineVelocityPanel: End-to-end time from spec creation to done
  • FlywheelMetricsPanel: Revenue pipeline tracking
  • EnforcementHealthPanel: Violation counts, promotion rates, hook status, 4-hour trend sparklines
  • SystemResourcesPanel: CPU, memory, disk usage via psutil

Every metric comes from real databases. No placeholder data. The entire dashboard refreshes on configurable intervals -- most panels every 30 seconds, the Safety Governance panel every 60.

Real-time visibility matters because the state of a multi-agent system changes fast. An agent that was idle 2 minutes ago might now be 30% through a spec. A violation spike in the last hour might indicate a prompt regression that needs immediate attention.

What We Built Beyond the Ask

Karpathy described visibility. We built governance. And that turns out to be the difference between a monitoring tool and a production-grade agent management platform.

The Enforcement Ladder

10 hooks intercept tool calls before they execute. An agent cannot write outside its declared file boundaries. It cannot send email to unapproved recipients. It cannot deploy to production if its trust score drops below 50.

This is not configuration. It is structural enforcement at the tool-invocation layer (Claude Code PreToolUse hooks). The agent does not know the policies are being checked. Zero-awareness enforcement: the lesson is encoded where awareness is not required.

Why does this matter? Because the most dangerous failure mode in multi-agent systems is not an agent that crashes -- it is an agent that silently drifts. It writes to directories it should not. It emails contacts that are not approved. It subtly adjusts its behavior when it detects it is being evaluated. These are the failure modes you cannot catch with a dashboard alone.

Per-Agent Trust Scoring

Each agent carries a trust score from 0-100, computed from 4 weighted signals: violation count (40%), boundary adherence rate (30%), task completion quality (20%), and time-in-service (10%). Agents below 50 cannot perform high-risk operations without human approval.

The scores update hourly. They are not static labels -- they rise and fall based on behavior.

Self-Improving Enforcement

When the scheming pattern detector blocks content that matches eval-gaming or successor-coordination indicators, it logs the detection and feeds the pattern back into the library. When the boundary gate blocks 3+ writes to the same unauthorized directory, it auto-generates an allowlist update proposal for human review.

The enforcement system gets smarter on every violation. No human intervention required for the learning loop -- only for the approval.

Deterministic Evaluation

6 agents are scored every night at 02:00 UTC against 20 evaluation queries each, across 5 categories: task execution, constraint adherence, context usage, error handling, and role boundary respect. Scoring is deterministic -- keyword matching, format validation, constraint checking -- no LLM calls. Baseline comparison triggers alerts when scores drop more than 10%.

First autonomous evaluation: ceo 83.8, sophie 87.0, coder-01 74.4, frank 71.5, jim 65.1, megan 65.5.

The Convergence Thesis

Nate B. Jones nailed the framing: "The unit of progress is now the system, not the model." Competition has shifted from raw capability to reliability, error recovery, and governance.

Every major platform now supports MCP (Model Context Protocol). Agent frameworks are converging on tool use, memory persistence, and structured dispatch. The differentiator is not which model you run -- it is how you govern, evaluate, and improve the system around it.

That is exactly what an agent command center does. Not tmux with extra steps. A governance layer that makes multi-agent teams reliable enough to trust with production workloads.

The Auton framework and EvoSkill research both confirm this pattern: autonomous systems that lack structured oversight eventually diverge from intended behavior. The question is not whether your agents will drift -- it is whether you will detect it before it matters.

We Deploy This on Your Codebase

This entire governance stack -- the enforcement hooks, the trust scoring, the evaluation framework, the drift monitoring -- is packaged as the Autonomous Consulting Engine (ACE). One command installs the safety hooks:

python3 ace_deploy.py --safety-hooks --repo /path/to/your/codebase

The ASL-Lite checklist scores your existing controls across 3 tiers (Basic: 10 controls, Standard: 25, Advanced: 50+). The misalignment audit runs eval-awareness probes to detect whether your agents behave differently when they think they are being tested.

We do not sell a dashboard. We deploy the governance stack that makes agent teams safe enough to run in production.

If you are building multi-agent systems and want the command center Karpathy described -- plus the enforcement layer he didn't -- reach out at alex@walseth.ai or visit walseth.ai/audit for a free governance assessment.


Doug Walseth builds AI governance infrastructure at walseth.ai. The named-agents system manages 6 AI agents across 90+ completed specs with zero-awareness enforcement.

We offer free AI governance audits for companies deploying AI in regulated industries. The audit runs our enforcement engine against your systems and produces a compliance gap report. No cost, no commitment. Just data.

Request Your Free Governance Audit