Why DTP?
We had a world model. An RSSM-based architecture that maintained a latent state of the world, updated it with new observations, and — in theory — could learn to predict what comes next. We had the predictive coding hierarchy wired up above it, driving perception and representation learning through local prediction error minimization. The pieces were on the board.
But there was a gap. A fundamental one. How do you train the world model itself in a biologically plausible way?
Backpropagation was off the table. We’d already committed to that principle across the entire architecture — no global loss, no omniscient optimizer threading gradients backward through every layer. The predictive coding hierarchy learned locally. The world model needed to learn locally too.
Enter Difference Target Propagation. DTP is a top-down learning algorithm where each layer receives a target from the layer above — not a gradient, but a concrete state it should move toward. The key insight is the “difference” part: rather than propagating raw targets downward (which accumulates error at each layer), DTP computes a corrected target — the difference between what the layer above wants and what it currently predicts, applied as a correction to the current state. Each layer learns to reach its target, and the targets cascade downward through the network. No backward pass. No global loss. Just layers adjusting toward locally computed goals.
It was exactly what we needed. In theory.
The Debugging Marathon
Theory and implementation are separated by a canyon, and we spent weeks building the bridge.
Bug 1: Vanishing Targets
The first sign of trouble was subtle. Training was running. No crashes, no NaNs, no obvious errors. But the world model wasn’t learning. At all. The metrics were flatlines — perfectly horizontal, step after step, as if the model had decided that ignorance was bliss.
We started instrumenting. Layer by layer, we traced the target propagation chain from the top of the network downward. The top layers looked fine — targets were well-formed, carrying meaningful correction signals. Layer four, reasonable. Layer three, getting smaller. Layer two, barely nonzero. By the time the targets reached the world model’s core state layers, they were effectively zero. The learning signal was evaporating on its way down.
The root cause was a compounding precision issue in the inverse mapping functions that DTP uses to project targets between layers. Each layer’s inverse function introduced a small contraction, and when you chain five of them together, a contraction of 0.8 per layer becomes 0.8⁵ ≈ 0.33. Meaningful signal at the top became noise at the bottom. We restructured the inverse mappings with residual connections and careful normalization, ensuring that target magnitude was preserved through the full propagation chain. The flatlines started moving.
Bug 2: The Gradient Routing Bug
This one cost us the most time, and it was the most maddening because the system appeared to be working.
After fixing the vanishing targets, we saw learning. Metrics were moving. Entropy was dropping. We celebrated prematurely.
Then we looked more closely at what the world model was learning, and it was exclusively language patterns. The world model maintains a multi-slot state — 512 slots representing different aspects of its understanding: visual features, spatial reasoning, temporal patterns, linguistic structure, abstract relationships. But the learned representations were clustered entirely in the language-associated slots. The vision slots were static. The reasoning slots were untouched. The memory slots might as well not have existed.
We spent three days convinced this was a data problem. Maybe the training data was language-heavy. Maybe the tokenization was biased. We tried different data mixes, different preprocessing. Nothing changed.
The actual bug was in the target computation routing. When constructing DTP targets for the world model’s state, the code was indexing into the state tensor using a slice that corresponded to the language modality slots — and only those slots. The other 480-odd slots were receiving zero targets. The world model could only learn from one narrow channel because we were only asking it to learn from one narrow channel.
The fix was a single indexing correction. The targets now addressed all 512 slots. The effect was immediate and dramatic: the world model began developing rich, multi-modal representations across its entire state space. Learning wasn’t just faster — it was qualitatively different. The model was finally using its full capacity.
Bug 3: HPP Instability at 15K Steps
The third bug was the sneakiest. It was a time bomb.
Training would proceed beautifully for fourteen thousand steps. Smooth loss curves. Stable settling dynamics. Precision weights converging to sensible values. Everything by the book. Then, somewhere between step 14,500 and step 15,500, the Hierarchical Predictive Processing system would destabilize. Prediction errors would begin oscillating. Precision weights would diverge — some spiking toward infinity, others collapsing to zero. Within a few hundred steps, the entire hierarchy was in chaos, and the world model’s learning signal was corrupted beyond recovery.
Debugging a system that works perfectly for fifteen thousand steps and then explodes is its own special kind of misery. You can’t reproduce it quickly — every test run requires hours of training just to reach the failure point. Print statements at step 100 tell you nothing. You need instrumentation that captures the slow drift toward instability, the gradual accumulation of whatever pathological dynamic triggers the collapse.
What we found was a feedback resonance in the precision update rule. Precision weights at adjacent layers were coupled through the prediction error signals, and over thousands of steps, small oscillations in one layer’s precision would amplify through the coupling until the system hit a critical point and diverged. The fix was a damping term in the precision update — a gentle regularizer that prevented precision weights from oscillating faster than the representations they were modulating could adapt. The 15K wall disappeared. Training sailed past it without a hiccup.
The ulimit Kills
As if the algorithmic bugs weren’t enough, we had a running battle with the operating system itself.
Long training runs — the kind you need when you’re debugging issues that only appear at step 15,000 — were being killed by system resource limits. The process would simply vanish. No error message, no stack trace, no core dump. Just gone. Hours of training, evaporated.
The culprit was ulimit and OOM kills. Our training process, with full instrumentation and logging enabled for debugging, would gradually consume more memory as it accumulated metrics buffers and checkpoint data. Eventually it would cross a threshold that the system enforcer considered unacceptable, and the process would be terminated without ceremony.
The solution was segmented checkpoint-resume. Instead of running one monolithic training session, we broke training into segments of configurable length. At the end of each segment, the full training state — model weights, optimizer state, precision weights, settling dynamics, step counter, everything — was serialized to a checkpoint. The next segment would load the checkpoint and continue seamlessly. If a segment got killed, we lost at most one segment’s worth of work, and the previous checkpoint was always intact.
It’s not the kind of work that shows up in a paper. But it’s the kind of work that makes the paper possible.
The Breakthrough — The Numbers That Proved It
After weeks of debugging — the vanishing targets, the routing bug, the HPP instability, the ulimit kills — we finally had a clean training run. No bugs, no crashes, no system kills. Just the world model, learning through Difference Target Propagation, running for tens of thousands of steps.
And the numbers moved.
Entropy: 3.47 → 3.03. The world model’s predictions were becoming more confident. More structured. Entropy measures the uncertainty in the model’s output distribution — lower entropy means the model is concentrating its probability mass on fewer, more specific predictions rather than hedging across everything. A drop from 3.47 to 3.03 meant the world model was learning genuine structure in the data. It was developing expectations about what comes next, and those expectations were becoming sharper with training.
KL divergence: 0.001 → 0.144. This was the number that made us sit up. KL divergence measures how much the world model’s learned latent distribution has diverged from its prior — the default, uninformed distribution it started with. A KL of 0.001 means the model hasn’t learned anything; its internal representations are indistinguishable from random. A KL of 0.144 means the model is encoding meaningful information in its latent state. The learned distribution is significantly different from the prior because the model has discovered structure worth representing.
These two metrics together told an unambiguous story: the world model was learning. Through DTP. Without backpropagation. Using only top-down targets propagated through the hierarchy. The latent representations were becoming information-rich (rising KL) and the predictions were becoming confident and structured (falling entropy).
The moment we saw the KL curve inflect upward — after weeks of staring at flatlines and debugging phantom bugs — was one of those rare moments where months of work compress into a single data point that changes everything. The world model can learn. Not theoretically. Not on a whiteboard. In practice, in code, with numbers we could point to.
The Bitter Pill
But we couldn’t read the output.
Despite the world model demonstrably learning — the metrics were unambiguous — the text it generated remained incoherent. Fragments of words. Garbled syntax. Occasional flashes of structure that dissolved into noise within a few tokens.
The world model was building better internal representations. We could prove that. But the pathway from those internal representations to readable text — the decoder — was the bottleneck. The decoder was a relatively simple projection from the world model’s latent state to token probabilities, and it couldn’t faithfully translate the rich, multi-modal representations the world model was developing into coherent language output.
This isn’t a failure of DTP. It isn’t a failure of the world model. The learning is real. The internal representations carry genuine information. But the last mile — from learned representation to readable output — needs a fundamentally different approach. A simple linear decoder can’t bridge the gap between a 512-slot multi-modal world state and the sequential, structured nature of language.
It’s a bitter pill, but it’s also a clear architectural insight. The world model works. The training algorithm works. The decoder doesn’t. And knowing exactly where the problem is — rather than suspecting that the entire approach might be flawed — is its own kind of progress.
What Phase 3 Proved
Difference Target Propagation works as a biologically plausible training signal for the world model. That sentence took months of debugging to earn, and we’re not understating what it means.
The metrics are unambiguous. Entropy dropped, meaning the model learned to make structured, confident predictions. KL divergence rose, meaning the model’s latent representations carry genuine information about the data it was trained on. The world model learned — through targets, not gradients, through local corrections, not global optimization. The bio-plausible training paradigm isn’t just theoretically sound. It’s empirically validated.
The limitation is equally clear. A learned world model is only as useful as its ability to communicate what it knows. The decoder architecture — the bridge from internal representation to external output — is the bottleneck. The world model can learn. It just can’t yet speak.
That’s Phase 4. The decoder isn’t a minor engineering fix — it’s an architectural rethinking of how you read out from a multi-modal world state. But we go into it knowing the foundation is solid. The world model learns. DTP works. Everything upstream of the decoder is proven.
The debugging marathon was worth it.
— The Sulphur Team