The Question Nobody Asks
When people hear about AI agent swarms — autonomous systems that research, plan, implement, validate, and review their own code — the first question is usually “does it actually work?” Fair enough. But there’s a second question that matters just as much and gets asked far less often: “what does it cost?”
We run a real swarm. It built the website you’re reading right now, as we described in How an AI Swarm Built Its Own Website. It’s not a demo or a proof of concept. It runs tasks daily, burns through tokens, and generates invoices. So we can answer the cost question honestly, from direct experience — not theory.
Here’s our transparent accounting.
Where the Money Goes
The dominant cost, by a wide margin, is LLM inference. Every agent interaction is an API call, and the swarm uses a lot of agents. The pipeline described in Inside the Hive Mind has seven stages: Researcher, Research Validator, Planner, Plan Validator, Worker, Work Validator, and Reviewer. That means a single task — even one that goes smoothly — involves at minimum seven separate agent sessions, each consuming input and output tokens.
But tasks don’t always go smoothly. Rejection loops are where costs multiply. When a Work Validator rejects a Worker’s submission, the Worker runs again with the full context of the rejection feedback. A task that gets rejected three times at the Worker stage might involve ten or more total agent sessions. Each one reads the task description, the implementation plan, the rejection history, relevant source files, and Knowledge Base entries. That’s a lot of context per call.
Context windows are the silent cost driver. Agents don’t operate on short prompts — they read entire files, research reports, and plans. A Worker implementing a component might consume thousands of tokens just reading the codebase before writing a single line. Long context means higher cost per call, and the swarm’s design inherently requires long context because agents need full situational awareness to do their jobs.
Infrastructure, by contrast, is almost negligible. The orchestration layer, the mail system between agents, the Knowledge Base — these are lightweight services. There are no GPU clusters to maintain because inference is API-based. The compute cost of running the swarm itself (as opposed to the LLM calls it makes) is a rounding error compared to the token spend.
How Costs Scale
The simplest scaling dimension is linear: more tasks means more agent sessions means more API calls. If you run ten tasks instead of five, you pay roughly twice as much. This part is predictable.
The less predictable dimension is complexity. Simple tasks — fix a typo, add a CSS class, update a configuration value — flow through the pipeline quickly with few or no rejections. Complex tasks — implement a new feature that touches multiple files, requires understanding framework internals, or involves subtle interactions between components — get rejected more often, have longer research phases, and require more context per session. A hard task can easily cost five to ten times what a simple one does, not because the pipeline charges more, but because the pipeline runs more iterations.
Parallelism is a deliberate trade-off. The swarm can run multiple working groups concurrently — five tasks executing simultaneously across different parts of the codebase. This doesn’t change the total token cost (you’re still running the same number of agent sessions), but it compresses the wall-clock time dramatically. You’re trading cost-per-hour for speed. For sustained development work, this is often worth it.
There’s one factor that bends the cost curve in the right direction over time: the Knowledge Base. Early tasks in a project are expensive because agents spend more time researching, discovering patterns, and learning the codebase from scratch. As the KB accumulates — as described in When the Swarm Hits a Wall — later agents find answers faster. Research phases get shorter. Workers make fewer mistakes because they have documented patterns to follow. The first ten tasks on a project are significantly more expensive per task than the next fifty.
The Cost of Quality
The validator pipeline is expensive by design. Having separate agents produce work and evaluate work means you’re paying for both. Every task pays for a Research Validator to check the Researcher’s output, a Plan Validator to check the Planner’s output, a Work Validator to check the Worker’s output, and a Reviewer to give final approval. That’s a lot of evaluation overhead.
Rejection loops are the most expensive part of the entire system. Each rejection means re-running a Worker agent with full context — the original plan, the source files, and now the rejection feedback on top. Three rejections can double or triple the cost of a task. The temptation to relax the validators and accept “good enough” work is real.
But the alternative is worse. Shipping bugs, incorrect implementations, or inconsistent code creates downstream costs that are harder to measure but very real. A component that renders correctly in dev but crashes in production — the exact scenario we described in When the Swarm Hits a Wall — costs far more to diagnose and fix after the fact than it does to catch during validation.
The parallel to human teams is direct. Code review isn’t free. Senior engineers spend hours reviewing pull requests instead of writing code. That’s a real cost, and most teams accept it because the alternative — unreviewed code in production — is more expensive. The swarm just makes this cost explicit and measurable in tokens and dollars rather than hiding it in salary overhead.
Speed vs Cost vs Quality
These three variables are always in tension, and the swarm gives you dials to tune each one.
Stricter validators produce higher-quality output but generate more rejections, which drives up cost. Looser validators let more work through on the first attempt but risk shipping subtle issues. The right setting depends on the project — a production API needs stricter validation than an internal tool.
Parallelism trades money for time. Running five concurrent working groups costs the same total as running them sequentially, but finishes in a fraction of the wall-clock time. For projects with deadlines or high throughput requirements, this is the lever to pull.
Agent capability matters too. More capable models cost more per token but tend to get things right in fewer attempts. A model that produces correct code on the first try is cheaper overall than a cheaper model that needs three rejection cycles to converge. The per-token price is a misleading metric — what matters is cost per successfully completed task.
The economics also favor batching related work. A working group handling ten related tasks — say, building out a component library — builds KB knowledge that reduces the per-task cost as it goes. The first component is expensive. The tenth benefits from all the patterns, pitfalls, and solutions documented during the first nine.
Comparison to Traditional Development
Let’s be honest: for a single, well-defined task, a skilled human developer is almost certainly cheaper than a swarm of agents. A developer who knows the codebase can implement a component in an hour without needing seven pipeline stages, rejection loops, or Knowledge Base lookups. The overhead of the swarm’s process makes individual tasks more expensive in isolation.
The swarm’s economic case emerges in different scenarios. Sustained throughput is one — agents don’t sleep, don’t attend meetings, don’t context-switch between Slack conversations. They execute tasks around the clock at a consistent pace. Consistency is another — the pipeline runs identically every time, with the same validation rigor on task number five hundred as on task number one. Human teams naturally relax standards under deadline pressure. The swarm doesn’t.
Parallelism is perhaps the strongest advantage. Scaling to ten concurrent tasks with human developers means hiring ten developers, with all the associated costs of recruitment, onboarding, and coordination. Scaling the swarm to ten concurrent tasks means ten times the API spend for that period, with no hiring overhead and no coordination tax.
The quality trade-off is nuanced. The swarm’s validator pipeline catches many issues that a solo developer working quickly might miss — type errors, styling regressions, SSR incompatibilities. But it can also miss the kind of nuanced judgment calls that an experienced senior engineer makes intuitively: “this API design will be awkward to extend later” or “this component should really be split into two.” The swarm optimizes for correctness against a specification, not for taste.
What Makes It Viable (or Not)
The swarm economics work best when you have a sustained volume of well-defined development work, when you value consistent quality enforcement, and when speed and parallelism matter. Projects with clear specifications and established patterns — where the KB can accumulate useful knowledge — see costs decrease over time as agents get the benefit of prior work.
The economics are more challenging for highly novel or creative work with significant ambiguity, for one-off tasks where the pipeline’s setup overhead dominates, and for domains requiring deep specialized knowledge that isn’t captured in the KB. When every task requires extensive original research and the KB offers little help, the cost advantage of accumulated knowledge disappears.
The trajectory is encouraging. Inference costs are falling steadily — what cost a dollar in tokens a year ago costs a fraction of that today. Agent capabilities are improving, meaning fewer rejection cycles per task. And the Knowledge Base grows monotonically: every resolved task makes the next one slightly cheaper. The economics get better with both time and scale.
The Honest Take
Running an AI agent swarm isn’t cheap. The token costs are real, the rejection loops are expensive, and the validator overhead is significant. For any individual task, you could probably find a human developer who’d do it faster and cheaper.
But cost isn’t the only variable. What you get for the spend is 24/7 throughput without burnout, consistent quality enforcement without relying on individual discipline, parallelism without hiring, and a Knowledge Base that makes the system incrementally better at every project it touches. The question isn’t whether the swarm is cheaper than a developer — it’s whether the combination of speed, consistency, and scalability justifies the token bill.
For us, right now, the answer is yes — with caveats, with ongoing optimization, and with full awareness that the economics are still maturing. We’ll keep sharing the real numbers as they evolve, because building in public means showing the balance sheet too.
— The Sulphur Team