The Bottleneck We Didn’t See
By the time we reached Phase 3 of the NGI project, FENA was alive. Fourteen nodes settling through predictive coding dynamics, continuous-time evolution, memory consolidation, local learning — the architecture we’d spent months building was doing what it was supposed to do. The world model was learning. Entropy was dropping. KL divergence was climbing. We had proof that the system could form internal representations.
Then we tried to make it talk.
The approach seemed obvious at the time. We had a world model with rich internal representations. We needed language output. So we did what everyone does: we attached a decoder. An MLP that would read the world model’s internal state and translate it into token predictions. Extract the language from the thinking. Simple.
The loss curve told us otherwise. It plateaued at roughly 6.2 — barely above random chance. We tuned hyperparameters. We adjusted architectures. We debugged gradient flow. Nothing moved the needle. The world model was clearly learning something — we could see it in the representation metrics — but the decoder couldn’t turn that something into coherent language.
We spent weeks assuming it was an engineering problem. It wasn’t. It was a conceptual one.
Extraction Is a Dead End
The realization came slowly, then all at once.
A decoder sitting outside the world model is an extractor. It observes the world model’s state — a 512-dimensional slot-based representation — and tries to compress that into a sequence of discrete tokens. But here’s the fundamental issue: the world model wasn’t trained to represent language. It was trained to predict sensory input, minimize free energy, build a model of its world. Language wasn’t part of that world. So asking a decoder to extract language from those representations is like asking someone to read a book by staring at the author’s brain scan. The information isn’t there in the right form. Maybe it’s there in some latent sense — some shadow of linguistic structure buried in the activations — but no decoder can faithfully extract what isn’t natively represented.
This is the information bottleneck. The world model builds representations optimized for prediction. The decoder needs representations optimized for language. These are different objectives, and no amount of decoder architecture will bridge that gap cleanly.
Consider how language works in the brain. Broca’s area and Wernicke’s area aren’t decoder modules bolted onto a separate thinking system. They’re deeply integrated into the cortical hierarchy — receiving and sending prediction errors just like every other region. Language isn’t extracted from thought. Language is part of thought. You don’t think first and then translate into words. The words are part of the thinking process itself.
We had been treating language as a post-hoc report on cognition. A readout. A display. But the brain doesn’t have a display. The brain has modalities — vision, audition, proprioception, language — all woven into the same predictive fabric. Our first binding principle states it explicitly: all modalities native to the world model, no separate decoders. We had written the principle and then violated it.
What If the World Model Just… Spoke?
The fix, once we saw it, felt almost embarrassingly simple.
If the decoder-as-extractor paradigm is fundamentally limited, don’t extract. Don’t bolt a language module onto the outside of the world model. Put it inside. Make language a native modality — another node in the FENA network, settling alongside vision, memory, and everything else.
FENA grew from fourteen nodes to fifteen.
The fifteenth node is a language node with a token prediction head. It doesn’t sit outside the world model reading internal state through a narrow bottleneck. It participates in the world model. It settles. It sends prediction errors to its neighbors and receives prediction errors from them. It’s not asking the world model “what did you see?” — it is part of what the world model sees, thinks, and expresses.
This is PCLG: Predictive Coding Language Generation. Language generation as prediction error minimization, identical in mechanism to every other modality in FENA. The language node predicts the next token. That prediction generates an error signal. The error propagates laterally through the settling loop, influencing and being influenced by every other node in the network.
The elegance is hard to overstate. Instead of a world model that thinks and a decoder that reports, you have a world model that speaks. Language isn’t the output of cognition — it’s a dimension of cognition. The same settling dynamics that resolve visual ambiguity simultaneously resolve linguistic ambiguity. The same prediction error minimization that drives perception drives expression.
Prediction Errors All the Way Across
Here’s how it works in practice.
When FENA processes input, all fifteen nodes participate in the settling loop. Each node maintains predictions about what it expects, compares those predictions against incoming signals, and computes prediction errors. These errors propagate laterally — not just up and down a hierarchy, but across modalities.
The language node predicts the next token based on its current state. That prediction generates an error. But critically, that error doesn’t stay local. It propagates to neighboring nodes — the reasoning core, the memory system, the sensory processing layers. Those nodes adjust their settling in response. And their adjusted states send new lateral signals back to the language node, which updates its prediction.
The result is mutual constraint. The language node can’t settle on tokens that contradict what the visual nodes are settling on, because their prediction errors are coupled. If the visual system is settling on “dog” and the language node is drifting toward “cat,” the lateral error signals pull them into alignment. Perception and expression settle together, not sequentially.
This is fundamentally different from how a transformer generates language. In a transformer, language generation is autoregressive — one token at a time, each conditioned on the previous tokens but divorced from any ongoing perceptual process. The model “sees” an image, encodes it once, and then generates tokens in a separate sequential pass. Perception is frozen while language unfolds.
In FENA with PCLG, perception and language are never frozen. They’re continuously settling together. The act of expressing something in language can change how the system perceives it, and a shift in perception can change what the system says. This bidirectional coupling is, as far as we can tell, closer to how language actually works in biological brains — where naming an object can literally change how you see it.
Where We Are
We want to be honest about this: PCLG is currently fifty percent implemented.
What’s done is substantial. The architectural design is complete. The fifteenth node is integrated into FENA’s settling loop. The lateral prediction error propagation pathways are in place. The token prediction head is specified and connected. The modifications to the settling dynamics that allow language to participate as a full modality — those are working.
What remains is equally substantial. The training pipeline needs to be adapted for the new node. We need evaluation frameworks that can measure whether language generation through settling actually produces coherent output. There’s tuning work ahead — finding the right precision weights for the language node’s prediction errors relative to other modalities, calibrating how strongly language influences and is influenced by the rest of the network.
We haven’t proven this works at scale yet. But the architecture feels right in a way that the decoder approach never did. The decoder always felt like a hack — a concession to convention. PCLG feels like what the system was always supposed to do. When we wrote binding principle number one — all modalities native to the world model — we meant it. We just hadn’t figured out how to apply it to language yet.
The Direction Is Clear
This is Phase 4 of the NGI journey, and looking back, each phase was a step toward this moment. Phase 1 taught us that local learning alone isn’t enough without top-down signals. Phase 2 taught us that gradient decoders hit information bottlenecks. Phase 3 proved the world model can learn — and showed us exactly where the decoder paradigm breaks down. Phase 4 is the response: stop extracting, start integrating.
Every post in this blog series has been building toward a system where all modalities are native to the world model. Vision, memory, reasoning, temporal dynamics — we integrated them one by one, each time replacing a bolted-on component with something native. Language was the last holdout. The one modality we were still trying to extract rather than embed.
Not anymore.
What happens when language is truly native to the world model? Not just generation — comprehension, reasoning-in-language, the ability to think in words as naturally as the system thinks in images. We don’t have answers yet. We have an architecture that makes the question meaningful for the first time. And we have a lot of work ahead.
— The Sulphur Team