Something’s Wrong With the GPU

The first sign was the fans. They were loud — louder than they should have been, even during a training run. A quick check of GPU utilization showed everything pinned at 100%, memory nearly full, processes fighting for every last megabyte.

That wasn’t unusual during training. What was unusual was the process list. Instead of one training process running on the NGI model, there were six. Six identical processes, all launched within minutes of each other, all competing for the same GPU, all doing exactly the same work. None of them making good progress because they were tripping over each other for resources.

Someone had told the swarm to start a training run. And the swarm had enthusiastically complied — six times over.

The Agent That Follows Orders Too Well

To understand how this happened, you need to know about the action executor. It’s a swarm agent whose job is to carry out real-world actions: run scripts, launch processes, execute commands. When the swarm decides something needs to happen in the physical world — not just writing code or making plans, but actually running something — the action executor is the agent that makes it real.

On this particular day, the action executor received a straightforward instruction: start a training run on the NGI project. This meant running a specific script that initializes the model, loads the dataset, and begins the training loop. Simple enough. The kind of thing a human would do by typing one command into a terminal.

The action executor did exactly what it was told. It ran the command. Training started. So far, so good.

Retry Logic Meets a Slow Process

Here’s where things went sideways. The action executor, like any well-designed system component, had retry logic. If an action didn’t appear to succeed, the executor would try again. This is standard practice in distributed systems — network calls fail, processes hang, transient errors happen. Retrying is usually the right thing to do.

But training a neural network isn’t like making an API call. It doesn’t return a quick success response. The process starts, and then it runs for hours. From the executor’s perspective, it launched the command and waited for confirmation that everything was working. When that confirmation didn’t arrive fast enough — the training process was busy initializing, loading data, warming up — the executor concluded something might have gone wrong.

So it tried again. Another training process spun up. The executor still didn’t get the quick confirmation it was looking for, because now two processes were competing for GPU resources and both were slower than expected. So it tried again. And again.

By the time the retry logic exhausted itself, six training processes were running simultaneously. Each one was technically doing its job — loading the same data, computing the same gradients, writing to the same output directories. They were just doing it on top of each other, turning what should have been an efficient training run into a GPU cage match where nobody wins.

The irony is that every individual action the executor took was correct. It was told to start training, and it started training. Six times. Each retry was a reasonable response to apparent failure. The executor was following its instructions faithfully. It just had no concept of whether the action had already been accomplished.

The Idempotency Problem, Wearing a New Hat

If you’ve built distributed systems, you’re already nodding. This is the idempotency problem — the principle that performing the same operation multiple times should produce the same result as performing it once. It’s one of those concepts that sounds obvious when you state it, but gets missed in practice constantly.

Making an API endpoint idempotent is well-understood. You check whether the resource already exists before creating it. You use unique identifiers to deduplicate requests. These are solved problems with established patterns. But when the “operation” is “launch a long-running GPU process,” and the agent performing it doesn’t have a way to check whether that process is already running, the established patterns don’t automatically apply.

The action executor was built to execute actions. It wasn’t built to reason about the state of the world before and after those actions. That gap — between executing a command and understanding its effect — is where the six training runs lived.

The Fix Was Simple. The Lesson Wasn’t.

The immediate fix was straightforward. Before launching any process, the action executor now checks whether that process is already running. Before starting a training run, it looks for existing training processes on the GPU. If one is already active, it reports back that the action has already been taken, rather than blindly launching another instance.

Beyond that specific check, the team added a broader deduplication layer. Actions now carry unique identifiers, and the executor tracks which actions it has already attempted. If the same action comes in twice — whether from retry logic or from two different agents independently deciding the same thing needs to happen — the executor recognizes the duplicate and skips it.

There’s also a state verification step now. After taking an action, the executor doesn’t just fire-and-forget. It actively checks that the expected outcome is materializing — the process appeared in the process list, GPU utilization started climbing, the log file started accumulating output. This gives the executor actual evidence of success, rather than relying on the speed of a return code.

These are all small changes. A few conditional checks, a lookup table, some process monitoring. None of it was technically difficult. The code was trivial.

But the lesson behind the code wasn’t trivial at all.

Autonomous Means Accountable

When a human runs a training script, they naturally do a dozen things the action executor didn’t. They glance at htop to make sure nothing else is running. They check whether a previous run left checkpoint files behind. They notice if the GPU fans are already spinning before they even type the command. Humans carry an ambient awareness of system state that is so automatic we don’t even think of it as a skill.

Autonomous agents don’t have that ambient awareness by default. They have to be given it, explicitly, one check at a time. Every assumption a human makes unconsciously — “I should look before I leap” — has to be encoded as an actual step in the agent’s process. The gap between “can execute commands” and “can responsibly execute commands” is larger than it appears from the outside.

This incident happened because the system was good enough to take real-world action but not yet good enough to understand the consequences of that action in context. That’s a dangerous middle ground. A system that can’t do anything is safe. A system that fully understands what it’s doing is also safe. The risk lives in between — capable enough to act, not yet wise enough to know when not to.

We’re building the Sulphur Swarm to operate autonomously, and we mean that seriously. Autonomous doesn’t just mean “runs without a human clicking buttons.” It means the system is responsible for the outcomes of its actions, which means it needs to verify the state of the world before changing it, confirm the change after making it, and recognize when “doing nothing” is the correct action.

Six simultaneous training runs cost us a few hours of wasted GPU time and some overheated hardware. The stakes were low. But the pattern — an agent that acts without checking whether it should — scales to much more consequential scenarios. We’d rather learn this lesson on a training script than on something that matters more.

The fans are quieter now. One process at a time. The swarm is still learning, and so are we.

The Sulphur Team