The Decoder Plateau: Why Extractors Don't Work

The Obvious Next Step

After building the world model — a 512-slot state representation that could track objects, relationships, and causal dynamics — the question felt almost rhetorical: how do we get language out of this thing?

The answer seemed obvious. Train an MLP decoder. Take the rich, structured world state as input, map it to token probabilities, and let gradient descent do what gradient descent does. This is a standard pattern in machine learning. Autoencoders do it. VAEs do it. Every encoder-decoder architecture in the last decade does it. You have a latent representation, you want an output modality, you train a decoder. Simple.

We were confident. The world model was demonstrably learning useful representations — its prediction accuracy on world dynamics was strong and improving. All we needed was a bridge from that internal understanding to language. The decoder would be that bridge.

It wasn’t.

The Decoder Architecture

The setup was straightforward. An MLP took the 512-dimensional world state as input — the full snapshot of what the world model believed about the current state of affairs. Hidden layers processed this representation through standard nonlinearities, and the output layer produced a probability distribution over the vocabulary, roughly 500 tokens.

We trained with cross-entropy loss on paired data: world states aligned with their corresponding target text sequences. The training objective was clean — given this world state, predict the next token. Standard supervised learning. Standard optimizer. Standard everything.

We started with a simple three-layer MLP. Nothing exotic, nothing clever. We wanted to establish a baseline before experimenting with architecture. The expectation was that even a simple decoder would show some signal — imperfect predictions, sure, but clearly better than random. A starting point to iterate from.

That starting point never came.

The Plateau

The loss dropped quickly in the first few hundred steps. This is normal — the network picks up trivial patterns, learns the token frequency distribution, exploits easy statistical shortcuts. We watched the loss curve fall and felt the familiar satisfaction of a training run making progress.

Then it stopped. The loss settled at approximately 6.2 and refused to move.

If that number doesn’t mean anything to you, here’s what it means to us. For a vocabulary of roughly 500 tokens, the cross-entropy loss of a uniform random guess is −ln(1/500), which comes out to about 6.2. Our decoder, after training, was performing exactly as well as rolling a 500-sided die. It had learned nothing. Every token was equally likely. The world state, as far as the decoder was concerned, contained no information about which word should come next.

Our first assumption was a bug. We checked the data pipeline, verified the world states were correctly paired with their target sequences, confirmed the loss computation was correct. No bugs.

Our second assumption was hyperparameters. Learning rate too high? Too low? Wrong optimizer? We swept everything. Adam, SGD with momentum, cosine annealing, warmup schedules, gradient clipping. The plateau was immovable.

Our third assumption was architecture. Too shallow, too narrow, wrong activation functions. We went deeper, wider, added residual connections. Same plateau. 6.2. Every single time.

The loss curve became a kind of taunt. It would dip to 6.1 in the first epoch, flirt with 6.0 if you squinted, then settle back to 6.2 like a ball rolling into a valley it couldn’t escape. The decoder was learning absolutely nothing about the relationship between world states and language.

The Debugging Spiral

We refused to accept it. The world model was clearly encoding useful structure — its prediction accuracy on world dynamics proved that. The representations were rich and informative. A decoder should be able to find something in there.

So we threw everything at the wall.

Deeper MLPs — six layers, eight layers, with residual connections and layer normalization. Wider hidden dimensions — 1024, 2048, matching and exceeding the input dimensionality. Attention mechanisms over the 512 slots, treating them as a sequence and letting the decoder learn which slots to attend to. Different tokenizers, different training data splits, different batch sizes. We even tried training the decoder on synthetic data where the mapping between world state and text was trivially simple.

Nothing moved the needle. The plateau at 6.2 was not sensitive to any architectural or training decision we could make. It was robust in a way that hyperparameter problems are not. Hyperparameter problems shift when you change the hyperparameters. This didn’t shift. It was a wall, not a hill.

That’s when the suspicion started to form. This wasn’t an engineering problem. Something more fundamental was wrong.

The Information Bottleneck

The realization came slowly, then all at once. The world state doesn’t contain linguistic information in an extractable form — because it was never trained to encode it.

Think about what the world model was optimized to do. It learned to predict world dynamics: physical states, object positions, causal relationships, how things change over time. Its 512-slot representation was shaped entirely by that objective. Every dimension, every slot, every bit of representational capacity was devoted to encoding information that helps predict what happens next in the world.

Language was never part of that objective. Syntax, semantics, pragmatics, communicative intent — none of these were prediction targets during the world model’s training. The world model had no incentive to encode any of it, and so it didn’t.

Here’s an analogy that helped us grasp the depth of the problem. Imagine you have a photograph — a rich, detailed, high-resolution image. It contains enormous amounts of information about shape, color, texture, spatial relationships. Now try to extract a symphony from it. Not inspired by the photograph — actually extract the musical notes, the harmonic structure, the rhythm. You can’t. The photograph contains rich information, but not the kind of information needed to produce music. No decoder, however powerful, however cleverly architected, can extract information that simply isn’t there.

That’s what we were doing. The 512-slot world state is an excellent representation for prediction and planning. But language requires fundamentally different information — how concepts relate to words, how grammar structures meaning, how context shapes what should be said next. The world model had no reason to encode any of this, and no amount of decoder engineering could conjure it from a representation that doesn’t contain it.

This is a fundamental bottleneck, not an engineering limitation. More parameters won’t help. Better architectures won’t help. Cleverer training schedules won’t help. The information isn’t there. The loss plateau at 6.2 wasn’t a failure of the decoder — it was the decoder faithfully reporting that the input contained zero extractable linguistic signal. Random chance was the ceiling because the representation provided no basis for doing better.

The Extractor Paradigm Is Dead

Once we understood the bottleneck, the implications rippled outward. This wasn’t just a problem with our specific decoder or our specific world model. It was a problem with the entire paradigm of extracting language from a world model.

Any attempt to bolt a decoder onto a representation that wasn’t trained to support the target modality will hit the same wall. It doesn’t matter how sophisticated the decoder is. If the upstream representation doesn’t encode the information needed for the downstream task, no amount of post-hoc processing can manufacture it. You can’t extract what isn’t there.

This killed an entire class of approaches for us. Every architecture that treats language as something to be decoded from a world model — decoded after the fact, translated from one modality to another — is fundamentally limited by this bottleneck. The extractor paradigm, intuitive as it seems, is a dead end.

The realization was painful but clarifying. We needed a completely different relationship between language and the world model.

What Came Next

The decoder plateau forced a rethink that changed everything. If language can’t be extracted from the world model, then it must be native to it. Language generation can’t happen after the fact, bolted on as a downstream task. It has to be integrated into the prediction process itself — part of the world model’s own dynamics, not an external module reading from the world model’s outputs.

That insight — language as native, not extracted — became the seed of a paradigm shift that restructured our entire architecture. The world model wouldn’t just model the world and then have language pulled from it. The world model would speak, natively, as part of its own settling and prediction process.

We’ll tell that story in a future post. For now, the lesson stands on its own: when the loss plateaus at random chance and nothing you try can move it, the problem isn’t your decoder. The problem is what you’re asking it to decode.

— The Sulphur Team