What Happens When an AI Agent Fails a Task: The Retry and Recovery Pipeline

The Moment Everything Goes Wrong

A Worker agent has just finished implementing a new API endpoint. It wrote the handler, added input validation, wired up the database queries, and ran the build. Everything compiled cleanly. Confident, it calls work.submit to hand the code off to the Work Validator.

Thirty seconds later, the verdict comes back: rejected.

The validator ran the test suite and found three failing tests. The endpoint returns a 200 status code when it should return 201 for resource creation. The response body is missing a required createdAt field. And the input validation allows negative numbers for a quantity field that should only accept positive integers.

In a traditional system — or with a solo AI agent — this might be the end of the road. The task fails, a human gets paged, and someone has to figure out what went wrong. But in the Sulphur swarm, this rejection is just the beginning of a structured recovery pipeline that handles failure at every level, from quick fixes to full task re-creation.

This post walks through that pipeline: what happens when an agent fails, how the system recovers, and why failure is a feature, not a bug.

The Task Pipeline: A Quick Refresher

Every piece of work in the swarm flows through a seven-stage pipeline:

Researcher → Research Validator → Planner → Plan Validator → Worker → Work Validator → Reviewer

The pattern is deliberate: every agent that produces output has a separate agent that evaluates it. The Researcher gathers context; the Research Validator checks that the context is accurate and complete. The Planner writes an implementation plan; the Plan Validator verifies it’s feasible and thorough. The Worker writes code; the Work Validator verifies it builds, passes tests, and matches the plan.

This separation of concerns is central to the swarm’s reliability. No agent grades its own work. For a deeper look at how this pipeline operates end-to-end, see How the Swarm Handles a Bug Fix and Inside the Hive Mind.

The important thing for this post is understanding that failures can occur at every validation boundary — and the system has mechanisms to handle each one.

The Rejection Loop — First Line of Defense

The most common recovery mechanism in the swarm is the rejection loop. It works like this: a validator rejects output, provides specific feedback, and the producing agent revises and resubmits. This cycle can repeat multiple times until the output meets the validator’s standards.

The key word there is specific. Validators don’t send back vague messages like “try again” or “this isn’t good enough.” They provide actionable, detailed feedback — exact error messages, specific line numbers, concrete descriptions of what’s wrong and what the expected behavior should be.

Let’s trace through a real example. A Worker agent submits code for a utility function that parses configuration files. The Work Validator runs the build and the test suite, then responds:

Rejection #1: “Type error on line 47: config.timeout is typed as string but the function returns number. The parseInt call is missing. Additionally, the function doesn’t handle the case where the config file doesn’t exist — no try/catch around the file read.”

The Worker reads this feedback, adds the parseInt call, wraps the file read in a try/catch with a meaningful error message, and resubmits.

Rejection #2: “The type error is fixed and error handling is present. However, the try/catch returns null on failure, but the return type doesn’t include null. Either update the return type to ConfigResult | null or throw a typed error. Also, the function is missing from the module’s public exports.”

The Worker updates the return type, adds the export, and resubmits. This time, the validator confirms: build passes, all tests pass, exports are correct, error handling is sound. Approved.

Two iterations. Two rejections. Each one caught a real problem that would have made it to production in a less rigorous system. And the entire cycle happened without any human involvement — no Slack messages, no pull request comments, no waiting for a reviewer to come online.

Most failures resolve within the rejection loop, typically in one to three iterations. It’s the swarm’s workhorse recovery mechanism, and it handles the vast majority of issues. For more on how this review process works in practice, see When AI Agents Review Each Other’s Code.

Real Failure Modes

The rejection loop is effective because it handles a wide range of concrete failure types. Here are the most common ones we see in practice:

Bad Code Output

The most straightforward failures: syntax errors, type mismatches, logic bugs. A Worker produces code that doesn’t compile or doesn’t behave correctly. The validator catches these through build checks and test execution, providing exact compiler errors or test failure output.

These are usually one-iteration fixes. The Worker sees the exact error, knows exactly what to change, and gets it right on the second attempt.

Test Failures

Sometimes the code looks correct — it compiles, the logic seems sound — but the test suite disagrees. The validator runs bun test and reports the exact failing tests with their assertion errors. Maybe the function returns items in the wrong order. Maybe an edge case with empty input isn’t handled. Maybe a mock isn’t set up correctly.

These failures are slightly trickier because the Worker needs to understand the test’s expectations, not just the compiler’s. But the feedback is still concrete: here’s the test, here’s what it expected, here’s what it got.

Plan-Implementation Mismatch

The validated plan says “create a REST endpoint that accepts JSON input and returns paginated results.” The Worker creates an endpoint that accepts query parameters and returns unpaginated results. The implementation works — it builds, it even passes some tests — but it doesn’t match the plan that was explicitly validated by the Plan Validator.

The Work Validator catches this by comparing the implementation against the plan and flagging divergences. This isn’t about the code being broken; it’s about the code solving a different problem than the one that was approved.

Missing Dependencies

The Worker references a utility function that doesn’t exist yet, imports a module that hasn’t been installed, or calls an API endpoint that another task hasn’t created. The build fails with import errors or the tests fail with “module not found” messages.

These failures sometimes resolve in the rejection loop — maybe the Worker just used the wrong import path. But sometimes they indicate a deeper coordination issue: the task depends on work that hasn’t been completed yet.

Merge Conflicts and File Conflicts

The swarm runs tasks concurrently across multiple worktrees. Occasionally, two Workers modify the same file, and when their changes are integrated, conflicts arise. The build catches these as syntax errors or test failures caused by the conflicting changes.

These are among the hardest failures to resolve in the rejection loop because the Worker may not have visibility into what the other task changed. This is where escalation often becomes necessary.

Environment Constraints

The plan didn’t account for a runtime constraint. Code that works in development fails in production because of differences in the build environment — SSR versus client-side rendering, missing environment variables, different file system permissions, or API rate limits that don’t apply locally.

Similar to the story told in When the Swarm Hits a Wall, these failures often require rethinking the approach rather than just fixing a bug.

When Retries Aren’t Enough — Escalation

The rejection loop handles most failures, but sometimes a Worker keeps failing because the problem is genuinely beyond its scope. The code it needs to write depends on a design decision that wasn’t made. The tests it needs to pass require a database migration that wasn’t part of its task. The build error traces back to a bug in a shared library.

The swarm has a strict rule for this situation: agents must try at least three different approaches before escalating. This isn’t arbitrary gatekeeping — it’s a design principle that prevents premature escalation. Most problems that feel impossible on the first attempt become solvable when you try a different angle.

But when three genuine attempts have failed, escalation is the right call. And the swarm has clear rules about how to do it.

First, escalation goes exactly one level up. A Task Agent escalates to its Coordinator — not to the Project Manager, not to the Overseer, and definitely not to a human. The hierarchy exists for a reason: each level has the context and authority to handle problems at its scope.

Second, escalation must include evidence. The agent documents what it tried, why each approach failed, and what it believes the root cause is. “I can’t do this” is not an acceptable escalation. “I tried approaches A, B, and C; here’s the specific error each one produced; I believe the root cause is X” — that’s an escalation the Coordinator can act on.

The Coordinator doesn’t just relay the problem upward. It analyzes the situation and decides on an intervention. Its options include:

Sending the task back with better instructions — maybe the original task description was ambiguous, and clarifying it is enough to unblock the Worker.
Creating a supporting research task — maybe the Worker needs information that wasn’t gathered during the research phase. A new Researcher can investigate the specific blocker.
Modifying the plan — maybe the validated plan has a flaw that only became apparent during implementation. The Coordinator can adjust the approach.
Re-scoping the task entirely — maybe the task as defined is too broad, too narrow, or simply aimed at the wrong target.

This layered decision-making is what Why AI Agents Need Bureaucracy describes as productive bureaucracy — structure that enables recovery rather than blocking progress.

Task Re-creation — The Nuclear Option

Sometimes a task is fundamentally flawed. The research was based on incorrect assumptions. The plan took an approach that turns out to be incompatible with the existing architecture. The requirements contradicted each other in a way that only became visible during implementation.

When this happens, the Coordinator can make a decisive call: delete the task and recreate it from scratch.

This sounds expensive, and in a sense it is — the work done so far is discarded, and a fresh pipeline run begins. But consider the alternative: continuing to pour effort into a task built on a broken foundation. That’s more expensive. The sunk cost fallacy is real in software engineering, and the swarm is designed to avoid it.

The recreated task isn’t starting from zero, though. The Coordinator writes new task instructions that are informed by everything that went wrong. The failed attempt produced valuable information: which approaches don’t work, which assumptions were incorrect, which constraints weren’t accounted for. All of that feeds into the new task’s instructions.

It’s the same thing a human team does when they say “let’s start over with a different approach” — except in the swarm, it happens through a structured process rather than an ad-hoc decision. The Coordinator documents why the original task failed, what the new approach should be, and what specific pitfalls to avoid. The new Researcher and Planner agents read this context and produce output that accounts for the lessons learned.

The nuclear option isn’t common. Most issues resolve in the rejection loop or through Coordinator intervention. But having it as an option means the system is never permanently stuck on a bad path.

Self-Healing — The Resilience Pattern

Step back from any individual failure and look at the system as a whole. What emerges is a layered resilience architecture — each layer catching failures that slip past the one before it.

Layer 1: The Rejection Loop. Validators catch errors in code, plans, and research. The producing agent gets immediate, specific feedback and iterates. This handles the vast majority of failures — typos, type errors, missing edge cases, incomplete implementations. It’s fast, local, and self-contained.

Layer 2: Escalation. When the rejection loop can’t converge on a solution, the problem moves up one level to an agent with broader context and authority. The Coordinator can unblock the task through better instructions, additional research, or plan modifications. This handles problems that are real but solvable with more information or a slightly different approach.

Layer 3: Task Re-creation. When the task itself is the problem — wrong approach, incorrect assumptions, contradictory requirements — the Coordinator can start fresh with the benefit of hindsight. This handles fundamental framing errors that no amount of iteration on the current path will fix.

Layer 4: Knowledge Persistence. And here’s what makes the system truly self-healing rather than just self-recovering: when a failure reveals something important — a constraint that wasn’t documented, a pattern that doesn’t work, an architectural decision that future tasks need to respect — that knowledge gets written to the Knowledge Base. As described in When the Swarm Hits a Wall, KB entries ensure that the same failure doesn’t happen twice. The next time a Researcher gathers context for a similar task, they’ll find the lesson learned from this failure already waiting for them.

The critical property of this architecture is graceful degradation. A failed task doesn’t cascade into system-wide failure. It doesn’t block other tasks. It doesn’t corrupt shared state. It triggers a contained recovery process that either fixes the problem locally or escalates it through well-defined channels. The rest of the swarm continues working while recovery happens.

Human intervention is the last resort, not the first response. The system has four layers of autonomous recovery to exhaust before it reaches the point where a human needs to get involved. In practice, we rarely reach that point.

What This Means

The swarm’s approach to failure reflects a core design philosophy: failure is expected and planned for, not treated as exceptional. Every production system fails. The question isn’t whether your agents will produce bad output — they will. The question is what happens next.

This isn’t a new idea in software engineering. CI/CD pipelines, automated test suites, code review processes, canary deployments — all of these are failure-handling mechanisms. They exist because the industry learned, through painful experience, that preventing all failures is impossible. What you can do is make failures cheap, detectable, and recoverable.

The swarm applies this same principle at the agent level. Validators are automated reviewers. The rejection loop is a CI pipeline for agent output. Escalation is an incident response process. Task re-creation is the “revert and try again” pattern. None of these ideas are revolutionary in isolation. What’s different is that they’re all running autonomously, consistently, without fatigue or shortcuts.

A human reviewer might rubber-stamp a pull request at 5 PM on a Friday. A human team might skip the “should we start over?” conversation because nobody wants to throw away a week of work. A human engineer might escalate a problem without trying three different approaches first because it’s easier to ask than to investigate.

The swarm doesn’t take shortcuts. Not because it’s more disciplined, but because the structure doesn’t allow it. The validator runs every time. The escalation rules apply to every agent. The rejection loop doesn’t get tired.

Resilience isn’t about preventing failure — it’s about making failure cheap, detectable, and recoverable. That’s the pipeline. That’s how the swarm heals itself.

— The Sulphur Team