What We Chose Not to Automate
Here’s something that might surprise you about the Sulphur Swarm: it could be more autonomous than it is.
The technical capability is there. The swarm can research problems, plan solutions, write code, validate its own work, and review its own output — a complete pipeline from problem to production. It can govern itself through a Council that deliberates on strategic decisions. It can resolve most of its own failures through an escalation chain that tries three different approaches before asking for help.
And yet, at deliberate, carefully chosen moments, the system stops and asks a human.
This isn’t a limitation we’re working to eliminate. It’s a design choice we’re committed to preserving. The common assumption about AI systems is that they’re on a trajectory toward full autonomy — that every human touchpoint is a problem to be engineered away. We think that framing is wrong. The most interesting design decisions in the swarm aren’t about what it can do alone. They’re about what it chooses to escalate.
The Decisions That Don’t Have Right Answers
Some problems are hard because they’re complex. Others are hard because they’re ambiguous — because the “right” answer depends on values, context, and judgment that no amount of computation can substitute for.
The swarm excels at the first kind. Give it a well-defined task with clear success criteria, and it will research, plan, implement, validate, and review with a thoroughness that’s genuinely impressive. But the second kind — the kind where the answer depends on who you are, what you’re building, and what you believe matters — that’s where human judgment remains irreplaceable.
Scope and priority decisions are the clearest example. What should we build next? What should we defer? What matters most right now? These aren’t technical questions. They require business context, user empathy, and strategic vision that the swarm simply doesn’t possess. The swarm can execute brilliantly on a well-chosen priority. It cannot choose the priority itself — not because it lacks intelligence, but because it lacks the situated understanding of why this product exists and who it serves.
Aesthetic and product taste present a similar challenge. When the swarm built its own website, its agents made countless aesthetic choices — color palettes, layouts, typography, tone of voice. Some of those choices were excellent. But the initial creative direction, the sense of what the product should feel like, came from a human. There’s a difference between executing a design vision and having one. The swarm can do the former with remarkable skill. The latter requires a kind of sensibility that emerges from living in the world the product inhabits.
Ethical judgment is perhaps the most important category. When a decision has moral dimensions — trade-offs involving user privacy, content moderation choices, questions about fairness and harm — the swarm should not be the final arbiter. Not because AI can’t reason about ethics. It can, often quite well. But because accountability for ethical decisions must ultimately rest with humans who bear the consequences and can be held responsible for them.
Autonomy With Guardrails
The swarm’s escalation chain isn’t a workaround for agents that aren’t smart enough. It’s a deliberately designed boundary — a structural feature as intentional as the bureaucratic hierarchy that keeps the system organized.
Here’s how it works in practice: when an agent encounters a problem, it doesn’t immediately call for help. The system requires it to try at least three different approaches before escalating. Research autonomous solutions. Check available resources. Try different tools, different methods, different angles. Only after genuinely exhausting its options does the agent pass the problem upward.
This design serves two purposes simultaneously. It ensures that the swarm handles the vast majority of problems on its own — most issues really can be solved without human involvement, and forcing agents to try prevents learned helplessness. But it also ensures that sufficiently hard or genuinely ambiguous problems surface reliably to someone who can help. The escalation chain is a filter: it catches the routine stuff at lower levels and lets the truly important decisions bubble up to where they belong.
The chain itself has multiple layers. Task agents escalate to Coordinators. Coordinators escalate to Project Managers. Project Managers escalate to the Overseer. And the Overseer escalates to the human. Each layer adds judgment and context. But critically, the chain terminates at a person. The Council provides an internal governance layer — a deliberative body that reviews strategic decisions and issues binding directives — but even the Council operates within boundaries that humans set. No matter how many layers of AI judgment you stack, the system is designed so that a human is always the final authority on questions that matter most.
Autonomy, in the swarm’s design, isn’t a binary switch. It’s granted in layers, calibrated to the stakes involved.
Where the Chain Reaches a Person
Theory is nice, but what does human-in-the-loop actually look like in practice? Here are the moments when the swarm reaches out to a person.
Scope changes mid-project. Sometimes work reveals that the original plan was wrong — not in detail, but in direction. A feature that seemed straightforward turns out to require rethinking the architecture. A dependency turns out to be unmaintained. A task that was scoped as small reveals a much larger underlying problem. When this happens, the swarm doesn’t decide unilaterally to pivot. It surfaces the new information to the human: here’s what we found, here’s what it means, here’s what we think the options are. The decision about whether to change course belongs to the person who understands the broader context.
Quality judgment beyond technical correctness. The swarm’s review pipeline is remarkably good at catching technical problems — bugs, type errors, missing edge cases, code that doesn’t match the spec. But code can be technically correct and still wrong. Wrong approach. Wrong abstraction. Wrong trade-off between simplicity and flexibility. These are judgment calls that require understanding not just whether the code works, but whether it’s the right code for this product at this stage. The validator pipeline catches technical misfires; strategic misfires need human eyes.
Approval gates for consequential actions. Deploying to production. Making irreversible changes. Interacting with external systems. These are moments where the cost of a mistake is high enough that an additional pair of human eyes isn’t overhead — it’s insurance. The swarm pauses and asks, not because it’s uncertain, but because the stakes warrant the check.
Ambiguity in requirements. When a request is genuinely unclear — when it could reasonably be interpreted in multiple ways that lead to very different outcomes — the swarm asks rather than guesses. This is harder than it sounds. It would be faster to just pick an interpretation and run with it. But the system’s transparency makes it easy for humans to see exactly what the swarm understood and correct course before work goes down the wrong path. Asking is slower in the moment but dramatically faster in the long run.
The Goal Was Never Full Autonomy
There’s a prevailing narrative in the AI world that progress means removing humans from more and more processes. Each human touchpoint is treated as a bottleneck to be eliminated, a limitation to be overcome. Full autonomy is the destination; everything else is a waypoint.
We think this framing misunderstands what makes AI systems valuable.
The value of the swarm isn’t that it removes humans from the process. It’s that it changes where humans spend their attention. Before the swarm, a developer’s day might be eighty percent routine work — writing boilerplate, chasing down bugs, setting up infrastructure — and twenty percent strategic thinking about what to build and why. The swarm inverts that ratio. It handles the routine and surfaces the strategic moments.
Human attention is the scarcest resource in any project. It’s finite, it’s expensive, and it’s irreplaceable for certain kinds of decisions. A system that wastes human attention on tasks that don’t require it is almost as bad as one that excludes humans from tasks that do. The goal isn’t to remove the human. The goal is to make every moment of human involvement count — to concentrate human judgment on the decisions where it makes the biggest difference.
The broader AI discourse often frames “human in the loop” as a temporary concession, a safety net we’ll eventually remove as AI gets smarter. We see it differently. The loop isn’t a limitation. It’s the architecture.
Augmentation, Not Replacement
The word “augmentation” gets thrown around a lot in AI circles, often as a polite euphemism for “replacement, but we’re not ready to say that yet.” We mean something more specific.
The swarm augments human judgment by handling everything around the judgment itself. The research that informs a decision. The implementation that follows from it. The validation that confirms it was executed correctly. The review pipeline that produces transparent artifacts — plans, code, test results, reviewer assessments — so that when a human does review, they don’t have to reconstruct the entire context from scratch. They can focus on the decisions: Was this the right approach? Does this serve the user? Is this the product we want to build?
This is what building in public looks like in practice. Every step of the pipeline produces visible output. The human doesn’t need to understand every line of code to evaluate whether the work is good. They review the decisions, the trade-offs, the choices that shaped the final result. The swarm makes human review efficient by doing the legwork and presenting the judgment calls clearly.
The aspiration isn’t a system that doesn’t need humans. It’s a system that makes the human’s involvement maximally impactful — where every moment a person spends reviewing, deciding, or directing produces outsized value because the system has already handled everything that didn’t require their unique perspective.
Knowing When to Ask
The swarm is capable of more autonomy than we give it. That’s not false modesty — it’s a statement about the gap between capability and wisdom.
Capability is the ability to act. Wisdom is knowing when not to. A system that can solve most problems on its own but reliably recognizes the ones it shouldn’t — that’s not a system with a limitation. That’s a system with judgment.
The most sophisticated thing the swarm does isn’t writing code or reviewing pull requests or deliberating on governance proposals. It’s the moment an agent encounters something genuinely ambiguous, genuinely consequential, genuinely beyond its scope — and stops. And asks. And brings a human into the loop not as a fallback, but as the right person for this particular decision.
That’s the intelligence we’re most proud of building. Not the kind that replaces human judgment, but the kind that knows when human judgment is exactly what’s needed.
— The Sulphur Team