Building an Agent Supervisor in Rust: Architecture Decisions That Mattered
When I decided to build a supervisor for AI coding agents, the language choice seemed obvious: Python, where all the AI tooling lives. I built it in Rust instead.
~51,000 lines later, here are the architecture decisions that mattered — and the ones that didn't.
Why Not Python
This isn't a language war post. Python is excellent for AI tooling. But for a supervisor — a long-running daemon that manages multiple agent processes, controls tmux sessions, manipulates files, and needs to be distributed to other developers' machines — Rust solved specific problems:
Single binary distribution. cargo install batty-cli and you're done. No virtualenv, no pip install -r requirements.txt, no "which Python version?" The binary is ~8.5 MB and runs anywhere with tmux.
Sub-second startup. The supervisor needs to launch fast, check status fast, and send messages fast. ~650ms for a full batty status call that loads config, resolves git root, and queries tmux. No interpreter startup, no import chains.
No GC pauses. The daemon runs a poll loop every 5 seconds — checking agent panes, delivering messages, running test gates, managing merges. A random 50ms garbage collection pause during a critical merge lock operation would be a real problem. Rust gives deterministic timing.
Compile-time message routing. The supervisor routes messages between agents with strict rules (engineers talk to managers, managers talk to architects). In Python, a typo in a message target is a runtime error. In Rust, it's a compile error. When you're building a system that runs unsupervised for hours, catching routing bugs at compile time matters.
The Architecture
Batty has five major subsystems. Each one has a story about what worked and what I'd change.
1. The Daemon Loop
The heart of Batty is a synchronous polling loop. Not async. Not event-driven. A loop with a sleep(5).
// Simplified — the real loop is ~200 lines with error handling
loop {
poll_watchers(); // Check tmux pane output
restart_dead_members(); // Respawn crashed agents
deliver_inbox_messages(); // Route queued messages via tmux
reconcile_active_tasks(); // Verify board state matches reality
run_interventions(); // Nudge idle agents, escalate blocks
maybe_generate_standup(); // Periodic team summaries
maybe_auto_dispatch(); // Assign unclaimed tasks to free agents
retry_failed_deliveries(); // Retry messages that didn't paste
maybe_rotate_board(); // Shuffle priorities for aging tasks
sleep(poll_interval); // Default: 5 seconds
}
Why synchronous? I started with async (tokio). The state management was a nightmare. The daemon tracks per-agent state, active tasks, delivery retries, nudge schedules, failure patterns — all mutable, all interdependent. Async forced me into Arc<Mutex<_>> everywhere, and the borrow checker fights were constant.
Switching to synchronous made the code drastically simpler. The poll interval is 5 seconds. Nothing in this system needs sub-second latency. The daemon isn't a web server — it's a supervisor. Five seconds between checks is fine.
The error boundary pattern. Not every step is equally critical. A failed pane poll is annoying but recoverable. A failed message delivery might lose work.
// Recoverable steps: log and skip after N consecutive failures
run_recoverable_step("poll_watchers", || self.poll_watchers());
run_recoverable_step("run_interventions", || self.run_interventions());
// Critical steps: always log, never skip
self.deliver_inbox_messages()?;
self.retry_failed_deliveries()?;
Each subsystem tracks its consecutive error count. After 5 consecutive failures, the daemon skips that step until it succeeds again. This prevents a flaky tmux query from blocking message delivery.
2. tmux Integration
Batty controls agent panes through tmux's CLI. No direct PTY manipulation, no libvte bindings. Just Command::new("tmux") with the right arguments.
Pane output monitoring: capture-pane -p reads the visible text in an agent's pane. The daemon hashes this output every poll cycle. If the hash changes, the agent is active. If it doesn't change for N cycles, the agent is idle (or stuck).
// Simple output hashing to detect state changes
fn detect_state(pane_output: &str, previous_hash: u64) -> AgentState {
let current_hash = simple_hash(pane_output);
if current_hash != previous_hash {
AgentState::Active
} else {
AgentState::Idle
}
}
Message injection: Sending a message to an agent means pasting text into its tmux pane. This is trickier than it sounds.
// 1. Pre-Enter to wake the agent's input buffer
send_keys(pane, "Enter");
// 2. Load message into tmux buffer (handles long text)
load_buffer(message_text);
// 3. Paste the buffer into the pane
paste_buffer(pane);
// 4. Trailing Enters to submit
send_keys(pane, "Enter");
send_keys(pane, "Enter");
// 5. Timing: longer messages need more paste time
let delay_ms = 500 + (message.len() / 100) * 50; // capped at 3000ms
The timing is tuned specifically for Claude Code's input handling. Different agent CLIs have different input buffer behaviors. The pre-Enter wakes Claude Code's prompt if it's waiting for input. The variable delay gives tmux time to paste long messages without buffer overflow.
Layout system: The daemon creates a tmux layout with zones — vertical columns for different role types. Architects get their own zone on the left, managers in the center, engineers on the right. Each zone splits horizontally for multiple instances.
layout:
zones:
- name: architect
width_pct: 33
- name: managers
width_pct: 33
- name: engineers
width_pct: 34
split: { horizontal: 3 } # 3 engineer panes stacked
3. File-Based IPC (Maildir)
Agent-to-agent communication uses Maildir — the same format email servers have used since 1995. Each agent gets an inbox directory:
.batty/inboxes/
eng-1-1/
new/ ← undelivered messages
cur/ ← delivered messages
tmp/ ← atomic write staging
eng-1-2/
new/
cur/
tmp/
manager/
new/
cur/
tmp/
Why Maildir over a database? Three reasons:
Atomic writes. Maildir's protocol: write to
tmp/, rename tonew/. Rename is atomic on POSIX filesystems. No partial writes, no corruption, no WAL.No server process. SQLite would work, but it's another dependency and another failure mode. Maildir is just files. If the filesystem works, IPC works.
Debuggable.
ls .batty/inboxes/eng-1-1/new/shows pending messages.catany message file. No special tooling needed.
Each message is a JSON file:
{
"from": "manager",
"to": "eng-1-1",
"body": "Tests failed (attempt 1/2). Fix the failures:\nthread 'test_jwt_auth' panicked at...",
"msg_type": "Send",
"timestamp": 1711108200
}
The daemon polls each inbox's new/ directory every cycle. When it finds a message, it pastes the body into the target agent's tmux pane, then moves the file from new/ to cur/. If the paste fails (pane dead, tmux error), the message stays in new/ for retry.
4. YAML Config & the Communication Graph
The team hierarchy is defined in YAML. Every role declares who it can talk to:
roles:
- name: architect
role_type: architect
agent: claude
talks_to: [manager]
- name: manager
role_type: manager
agent: claude
talks_to: [architect, engineer]
- name: engineer
role_type: engineer
agent: codex
instances: 3
talks_to: [manager]
use_worktrees: true
The talks_to graph is validated at startup:
- All targets must exist as defined roles
- The graph must be acyclic (no circular dependencies)
- Managers must have exactly one
report_to - At most one orchestrator role per team
Why explicit communication? Early versions let any agent message any other agent. Chaos. Engineers would ask the architect questions directly, bypassing the manager. The architect would get flooded with low-level implementation details. Token costs exploded from coordination overhead.
The explicit graph enforces hierarchy. Engineers talk to their manager. The manager talks to the architect. Information flows up and down the chain, not laterally. It mirrors how effective human teams work — and it dramatically reduces token overhead.
5. JSONL Event Logging
Every significant event gets appended to .batty/team_config/events.jsonl:
{"ts":1711108200,"event":"task_assigned","engineer":"eng-1-1","task_id":27}
{"ts":1711108215,"event":"agent_launched","agent":"codex","work_dir":".batty/worktrees/eng-1-1"}
{"ts":1711108890,"event":"test_executed","task_id":27,"passed":false,"exit_code":1}
{"ts":1711108950,"event":"message_delivered","from":"batty","to":"eng-1-1","msg_type":"test_failure"}
{"ts":1711109400,"event":"test_executed","task_id":27,"passed":true,"exit_code":0}
{"ts":1711109405,"event":"merge","source":"eng-1-1/task-27","target":"main"}
Why JSONL over structured logging? Append-only, one JSON object per line, grep-able, jq-able. The log rotates at 10 MB. No log aggregation service needed — it's a file on disk.
The event types cover the full lifecycle: daemon start/stop, agent launch/death, task assignment/completion, test runs, merges, message deliveries, standup generation, retrospectives. Over 40 event types.
What this enables: After a session, you can reconstruct exactly what happened:
# Which tasks failed tests?
cat events.jsonl | jq 'select(.event == "test_executed" and .passed == false)'
# Average time from assignment to completion?
cat events.jsonl | jq 'select(.event == "task_assigned" or .event == "task_completed")'
# Which engineer has the most test failures?
cat events.jsonl | jq 'select(.event == "test_executed" and .passed == false) | .engineer' | sort | uniq -c
No dashboards. No metrics pipeline. Just jq.
What Rust Gave Us
Concrete benefits, not language advocacy:
Typed error boundaries. Each subsystem has its own error type — GitError, BoardError, TmuxError, DeliveryError. Each has an is_transient() method that tells the daemon whether to retry or fail. This classification is checked at compile time. A new error variant that forgets to implement is_transient() is a compile error.
Fearless concurrency (where we actually need it). The event buffer is shared between the watcher thread and the daemon loop: Arc<Mutex<EventBuffer>>. Rust guarantees at compile time that we can't read and write simultaneously. In a long-running daemon, data races are the bugs that show up at 3 AM. Rust prevents them at build time.
Predictable resource usage. No memory leaks from forgotten references. No GC spikes during critical operations. The daemon's memory footprint is stable over hours of operation. For a tool that manages other processes, being a well-behaved process yourself matters.
What Rust Cost Us
Being honest:
Contributor barrier. Rust's learning curve is real. A Python version of this tool would have more casual contributors. The borrow checker, lifetimes, and trait system are powerful but not approachable for "I want to fix this one thing" contributions.
Verbose error handling. The ? operator is elegant, but error context chains (anyhow::Context) make every fallible call two lines instead of one. The daemon's 6,200 lines would be ~4,000 in Python. Most of the difference is error handling boilerplate.
Slower iteration. Compile times aren't terrible (~10-15 seconds from scratch), but they add up during rapid prototyping. The tmux integration especially required lots of trial-and-error with timing and buffer behavior. Each iteration had a compile step Python wouldn't have.
String manipulation isn't fun. Parsing tmux output, formatting messages, manipulating YAML frontmatter — all of this involves String vs &str, .to_string() everywhere, and occasional clone() to satisfy the borrow checker. It works, but it's not Rust's strength.
Decisions That Didn't Matter
Some things I agonized over that turned out to be irrelevant:
Async vs sync. I spent two weeks on the async version before scrapping it. The daemon polls every 5 seconds. Async gave me nothing except complexity. If your hot loop is sleep(5), you don't need tokio.
Database vs files. I considered SQLite for task state. Files won because they're debuggable with cat and versionable with git. The performance difference is invisible at the scale of "50 tasks, 5 agents."
Binary size. 8.5 MB felt large until I realized nobody cares. cargo install downloads and builds. The binary size doesn't matter if the install is a one-liner.
The Decision That Mattered Most
Making the entire system file-based. YAML config, Markdown kanban, Maildir inboxes, JSONL logs. Everything is a file.
This means:
git diffshows you every state changecatreads any piece of the systemvimcan modify any configuration or task while the daemon is running- Backups are
cp -r .batty/ /backup/ - Debugging is
lsandgrep, not database queries
The file-based approach constrains the architecture in useful ways. No distributed state. No eventual consistency. No cache invalidation. Just files on one machine, managed by one daemon, tracked by one git repo.
For a tool that helps developers manage AI agents, meeting them where they already are — the filesystem and the terminal — turned out to be the right call.
GitHub: github.com/battysh/batty crates.io: batty-cli
If you're building CLI tools in Rust — especially anything that manages long-running processes or integrates with tmux — I'm happy to go deeper on any of these patterns. What architecture questions do you have?
Batty is open source, v0.1.0, ~51K lines of Rust. Built as a terminal supervisor for AI coding agents. Works with Claude Code, Codex, and Aider.
