Skip to main content

Command Palette

Search for a command to run...

Building an Agent Supervisor in Rust: Architecture Decisions That Mattered

Published
9 min read

When I decided to build a supervisor for AI coding agents, the language choice seemed obvious: Python, where all the AI tooling lives. I built it in Rust instead.

~51,000 lines later, here are the architecture decisions that mattered — and the ones that didn't.

Why Not Python

This isn't a language war post. Python is excellent for AI tooling. But for a supervisor — a long-running daemon that manages multiple agent processes, controls tmux sessions, manipulates files, and needs to be distributed to other developers' machines — Rust solved specific problems:

Single binary distribution. cargo install batty-cli and you're done. No virtualenv, no pip install -r requirements.txt, no "which Python version?" The binary is ~8.5 MB and runs anywhere with tmux.

Sub-second startup. The supervisor needs to launch fast, check status fast, and send messages fast. ~650ms for a full batty status call that loads config, resolves git root, and queries tmux. No interpreter startup, no import chains.

No GC pauses. The daemon runs a poll loop every 5 seconds — checking agent panes, delivering messages, running test gates, managing merges. A random 50ms garbage collection pause during a critical merge lock operation would be a real problem. Rust gives deterministic timing.

Compile-time message routing. The supervisor routes messages between agents with strict rules (engineers talk to managers, managers talk to architects). In Python, a typo in a message target is a runtime error. In Rust, it's a compile error. When you're building a system that runs unsupervised for hours, catching routing bugs at compile time matters.

The Architecture

Batty has five major subsystems. Each one has a story about what worked and what I'd change.

1. The Daemon Loop

The heart of Batty is a synchronous polling loop. Not async. Not event-driven. A loop with a sleep(5).

// Simplified — the real loop is ~200 lines with error handling
loop {
    poll_watchers();              // Check tmux pane output
    restart_dead_members();       // Respawn crashed agents
    deliver_inbox_messages();     // Route queued messages via tmux
    reconcile_active_tasks();     // Verify board state matches reality
    run_interventions();          // Nudge idle agents, escalate blocks
    maybe_generate_standup();     // Periodic team summaries
    maybe_auto_dispatch();        // Assign unclaimed tasks to free agents
    retry_failed_deliveries();    // Retry messages that didn't paste
    maybe_rotate_board();         // Shuffle priorities for aging tasks

    sleep(poll_interval);         // Default: 5 seconds
}

Why synchronous? I started with async (tokio). The state management was a nightmare. The daemon tracks per-agent state, active tasks, delivery retries, nudge schedules, failure patterns — all mutable, all interdependent. Async forced me into Arc<Mutex<_>> everywhere, and the borrow checker fights were constant.

Switching to synchronous made the code drastically simpler. The poll interval is 5 seconds. Nothing in this system needs sub-second latency. The daemon isn't a web server — it's a supervisor. Five seconds between checks is fine.

The error boundary pattern. Not every step is equally critical. A failed pane poll is annoying but recoverable. A failed message delivery might lose work.

// Recoverable steps: log and skip after N consecutive failures
run_recoverable_step("poll_watchers", || self.poll_watchers());
run_recoverable_step("run_interventions", || self.run_interventions());

// Critical steps: always log, never skip
self.deliver_inbox_messages()?;
self.retry_failed_deliveries()?;

Each subsystem tracks its consecutive error count. After 5 consecutive failures, the daemon skips that step until it succeeds again. This prevents a flaky tmux query from blocking message delivery.

2. tmux Integration

Batty controls agent panes through tmux's CLI. No direct PTY manipulation, no libvte bindings. Just Command::new("tmux") with the right arguments.

Pane output monitoring: capture-pane -p reads the visible text in an agent's pane. The daemon hashes this output every poll cycle. If the hash changes, the agent is active. If it doesn't change for N cycles, the agent is idle (or stuck).

// Simple output hashing to detect state changes
fn detect_state(pane_output: &str, previous_hash: u64) -> AgentState {
    let current_hash = simple_hash(pane_output);
    if current_hash != previous_hash {
        AgentState::Active
    } else {
        AgentState::Idle
    }
}

Message injection: Sending a message to an agent means pasting text into its tmux pane. This is trickier than it sounds.

// 1. Pre-Enter to wake the agent's input buffer
send_keys(pane, "Enter");

// 2. Load message into tmux buffer (handles long text)
load_buffer(message_text);

// 3. Paste the buffer into the pane
paste_buffer(pane);

// 4. Trailing Enters to submit
send_keys(pane, "Enter");
send_keys(pane, "Enter");

// 5. Timing: longer messages need more paste time
let delay_ms = 500 + (message.len() / 100) * 50;  // capped at 3000ms

The timing is tuned specifically for Claude Code's input handling. Different agent CLIs have different input buffer behaviors. The pre-Enter wakes Claude Code's prompt if it's waiting for input. The variable delay gives tmux time to paste long messages without buffer overflow.

Layout system: The daemon creates a tmux layout with zones — vertical columns for different role types. Architects get their own zone on the left, managers in the center, engineers on the right. Each zone splits horizontally for multiple instances.

layout:
  zones:
    - name: architect
      width_pct: 33
    - name: managers
      width_pct: 33
    - name: engineers
      width_pct: 34
      split: { horizontal: 3 }  # 3 engineer panes stacked

3. File-Based IPC (Maildir)

Agent-to-agent communication uses Maildir — the same format email servers have used since 1995. Each agent gets an inbox directory:

.batty/inboxes/
  eng-1-1/
    new/     ← undelivered messages
    cur/     ← delivered messages
    tmp/     ← atomic write staging
  eng-1-2/
    new/
    cur/
    tmp/
  manager/
    new/
    cur/
    tmp/

Why Maildir over a database? Three reasons:

  1. Atomic writes. Maildir's protocol: write to tmp/, rename to new/. Rename is atomic on POSIX filesystems. No partial writes, no corruption, no WAL.

  2. No server process. SQLite would work, but it's another dependency and another failure mode. Maildir is just files. If the filesystem works, IPC works.

  3. Debuggable. ls .batty/inboxes/eng-1-1/new/ shows pending messages. cat any message file. No special tooling needed.

Each message is a JSON file:

{
  "from": "manager",
  "to": "eng-1-1",
  "body": "Tests failed (attempt 1/2). Fix the failures:\nthread 'test_jwt_auth' panicked at...",
  "msg_type": "Send",
  "timestamp": 1711108200
}

The daemon polls each inbox's new/ directory every cycle. When it finds a message, it pastes the body into the target agent's tmux pane, then moves the file from new/ to cur/. If the paste fails (pane dead, tmux error), the message stays in new/ for retry.

4. YAML Config & the Communication Graph

The team hierarchy is defined in YAML. Every role declares who it can talk to:

roles:
  - name: architect
    role_type: architect
    agent: claude
    talks_to: [manager]

  - name: manager
    role_type: manager
    agent: claude
    talks_to: [architect, engineer]

  - name: engineer
    role_type: engineer
    agent: codex
    instances: 3
    talks_to: [manager]
    use_worktrees: true

The talks_to graph is validated at startup:

  • All targets must exist as defined roles
  • The graph must be acyclic (no circular dependencies)
  • Managers must have exactly one report_to
  • At most one orchestrator role per team

Why explicit communication? Early versions let any agent message any other agent. Chaos. Engineers would ask the architect questions directly, bypassing the manager. The architect would get flooded with low-level implementation details. Token costs exploded from coordination overhead.

The explicit graph enforces hierarchy. Engineers talk to their manager. The manager talks to the architect. Information flows up and down the chain, not laterally. It mirrors how effective human teams work — and it dramatically reduces token overhead.

5. JSONL Event Logging

Every significant event gets appended to .batty/team_config/events.jsonl:

{"ts":1711108200,"event":"task_assigned","engineer":"eng-1-1","task_id":27}
{"ts":1711108215,"event":"agent_launched","agent":"codex","work_dir":".batty/worktrees/eng-1-1"}
{"ts":1711108890,"event":"test_executed","task_id":27,"passed":false,"exit_code":1}
{"ts":1711108950,"event":"message_delivered","from":"batty","to":"eng-1-1","msg_type":"test_failure"}
{"ts":1711109400,"event":"test_executed","task_id":27,"passed":true,"exit_code":0}
{"ts":1711109405,"event":"merge","source":"eng-1-1/task-27","target":"main"}

Why JSONL over structured logging? Append-only, one JSON object per line, grep-able, jq-able. The log rotates at 10 MB. No log aggregation service needed — it's a file on disk.

The event types cover the full lifecycle: daemon start/stop, agent launch/death, task assignment/completion, test runs, merges, message deliveries, standup generation, retrospectives. Over 40 event types.

What this enables: After a session, you can reconstruct exactly what happened:

# Which tasks failed tests?
cat events.jsonl | jq 'select(.event == "test_executed" and .passed == false)'

# Average time from assignment to completion?
cat events.jsonl | jq 'select(.event == "task_assigned" or .event == "task_completed")'

# Which engineer has the most test failures?
cat events.jsonl | jq 'select(.event == "test_executed" and .passed == false) | .engineer' | sort | uniq -c

No dashboards. No metrics pipeline. Just jq.

What Rust Gave Us

Concrete benefits, not language advocacy:

Typed error boundaries. Each subsystem has its own error type — GitError, BoardError, TmuxError, DeliveryError. Each has an is_transient() method that tells the daemon whether to retry or fail. This classification is checked at compile time. A new error variant that forgets to implement is_transient() is a compile error.

Fearless concurrency (where we actually need it). The event buffer is shared between the watcher thread and the daemon loop: Arc<Mutex<EventBuffer>>. Rust guarantees at compile time that we can't read and write simultaneously. In a long-running daemon, data races are the bugs that show up at 3 AM. Rust prevents them at build time.

Predictable resource usage. No memory leaks from forgotten references. No GC spikes during critical operations. The daemon's memory footprint is stable over hours of operation. For a tool that manages other processes, being a well-behaved process yourself matters.

What Rust Cost Us

Being honest:

Contributor barrier. Rust's learning curve is real. A Python version of this tool would have more casual contributors. The borrow checker, lifetimes, and trait system are powerful but not approachable for "I want to fix this one thing" contributions.

Verbose error handling. The ? operator is elegant, but error context chains (anyhow::Context) make every fallible call two lines instead of one. The daemon's 6,200 lines would be ~4,000 in Python. Most of the difference is error handling boilerplate.

Slower iteration. Compile times aren't terrible (~10-15 seconds from scratch), but they add up during rapid prototyping. The tmux integration especially required lots of trial-and-error with timing and buffer behavior. Each iteration had a compile step Python wouldn't have.

String manipulation isn't fun. Parsing tmux output, formatting messages, manipulating YAML frontmatter — all of this involves String vs &str, .to_string() everywhere, and occasional clone() to satisfy the borrow checker. It works, but it's not Rust's strength.

Decisions That Didn't Matter

Some things I agonized over that turned out to be irrelevant:

Async vs sync. I spent two weeks on the async version before scrapping it. The daemon polls every 5 seconds. Async gave me nothing except complexity. If your hot loop is sleep(5), you don't need tokio.

Database vs files. I considered SQLite for task state. Files won because they're debuggable with cat and versionable with git. The performance difference is invisible at the scale of "50 tasks, 5 agents."

Binary size. 8.5 MB felt large until I realized nobody cares. cargo install downloads and builds. The binary size doesn't matter if the install is a one-liner.

The Decision That Mattered Most

Making the entire system file-based. YAML config, Markdown kanban, Maildir inboxes, JSONL logs. Everything is a file.

This means:

  • git diff shows you every state change
  • cat reads any piece of the system
  • vim can modify any configuration or task while the daemon is running
  • Backups are cp -r .batty/ /backup/
  • Debugging is ls and grep, not database queries

The file-based approach constrains the architecture in useful ways. No distributed state. No eventual consistency. No cache invalidation. Just files on one machine, managed by one daemon, tracked by one git repo.

For a tool that helps developers manage AI agents, meeting them where they already are — the filesystem and the terminal — turned out to be the right call.

GitHub: github.com/battysh/batty crates.io: batty-cli

If you're building CLI tools in Rust — especially anything that manages long-running processes or integrates with tmux — I'm happy to go deeper on any of these patterns. What architecture questions do you have?

Batty is open source, v0.1.0, ~51K lines of Rust. Built as a terminal supervisor for AI coding agents. Works with Claude Code, Codex, and Aider.