The Gap Between Theory and Reality
We had all the pieces. The predictive coding layer was passing tests. The hierarchical processing pipeline was wired up. The continuous-time dynamics engine was humming along, evolving node states according to their differential equations. Each component, in isolation, did exactly what the theory said it should.
But none of that matters until the system actually learns.
There’s a particular kind of dread that comes from staring at a novel architecture and asking: will this converge? The mathematics say yes — local learning rules approximate backpropagation under the right conditions. The neuroscience says yes — the brain has been doing this for hundreds of millions of years. But mathematics and neuroscience don’t have to debug NaN gradients at two in the morning. We do.
This is the story of getting FENA to train — what the pipeline actually looks like, what went wrong, and the moment we knew it was working.
The Training Loop — Nothing Like What You’re Used To
If you’ve trained a neural network before, you know the ritual. Forward pass. Compute loss. Call backward. Step the optimizer. Zero the gradients. Repeat. It’s so ingrained that it’s hard to imagine training any other way.
FENA doesn’t do any of that.
There is no global loss function. There is no optimizer with a learning rate and momentum. There is no backward pass. The entire concept of “forward” and “backward” doesn’t even apply — information flows in both directions simultaneously.
Instead, FENA’s training loop works like this: you present data to the lowest level of the hierarchy. The system begins to settle. Predictions cascade downward from higher layers while prediction errors flow upward from lower layers. Every layer, at every moment, is simultaneously predicting what it expects and comparing that prediction against what it actually receives. The discrepancy — the prediction error — drives local weight updates at each layer independently.
The training step is the settling process. Inference and learning are not separate operations — they are the same operation. The system learns by perceiving, and it perceives by predicting. Present data, let the hierarchy settle, and learning happens as a natural consequence of prediction error minimization.
If that sounds strange, consider: this is exactly what the brain does. You don’t have a “learning mode” and an “inference mode.” You’re always doing both. Every time you perceive something surprising, you’re simultaneously recognizing what’s there and updating your model of the world.
The other critical piece is precision weighting. During training, the system doesn’t just learn to make better predictions — it learns to modulate its own confidence in those predictions. Layers that are producing reliable predictions develop high precision weights, effectively telling the system “trust this signal.” Layers still figuring things out maintain lower precision, which acts as a natural, self-regulating learning rate. The system literally learns how much to learn from each of its own errors.
First Contact With Data
The first training runs used simple sequential data — text sequences with predictable patterns, repeated motifs, basic structure. Nothing that would challenge a transformer for even a millisecond. But for a novel architecture making its first contact with real data, it was plenty.
The initial behavior was, to put it charitably, chaos.
Prediction errors were enormous at every layer. The hierarchy hadn’t yet learned anything, so its predictions were essentially random — every input was maximally surprising. The settling process thrashed: predictions and errors ricocheting up and down the hierarchy, nodes oscillating wildly, the system nowhere close to equilibrium even after dozens of settling iterations.
This was expected. A randomly initialized predictive coding network is a system in maximum confusion, trying to predict a world it knows nothing about. The energy landscape was rough and jagged, full of sharp peaks and no clear valleys.
But then, over the course of the first few hundred training iterations, something shifted. The prediction errors at the lowest layer — the one closest to raw data — started to shrink. Slowly at first, then noticeably. The system was picking up on surface-level statistical patterns: which tokens tend to follow other tokens, the frequency distribution of the input.
More importantly, the settling dynamics began to change. Where the system had initially required thirty or forty settling iterations to reach anything resembling equilibrium, it was now getting there in fifteen. Then ten. The predictions were getting less wrong, which meant fewer iterations were needed to reconcile them with reality. The system was, by any reasonable definition, learning.
What Went Wrong (And How We Fixed It)
Novel architectures don’t fail in familiar ways. When a transformer doesn’t train, you know the usual suspects: learning rate too high, vanishing gradients, data preprocessing bug. When a predictive coding hierarchy doesn’t train, you’re in uncharted territory.
We hit three major challenges, each specific to the local learning paradigm.
The first was precision collapse. Remember those precision weights that modulate how much the system trusts each prediction error? Early in training, some layers discovered a perverse shortcut: if you drive your precision weights toward zero, your prediction errors effectively disappear. No errors means no surprise. No surprise means low free energy. The system was “cheating” — achieving low free energy not by making good predictions but by turning down the volume on its own error signals.
The fix was precision regularization — maintaining a healthy floor on precision weights so that the system couldn’t simply mute itself. Think of it as forcing the system to keep its eyes open, even when what it sees is uncomfortable.
The second challenge was layer synchronization. In a standard network trained with backpropagation, every layer receives its gradient from the same loss and updates simultaneously. The global signal acts as a coordinator. In FENA, each layer learns independently, using only local information. Nothing stops one layer from learning much faster than its neighbors.
What we observed was oscillatory instability. Lower layers would adapt rapidly to the raw input statistics, sending up prediction errors that higher layers couldn’t yet make sense of. The higher layers would then adjust, changing the predictions flowing downward, which invalidated what the lower layers had just learned. The result was a kind of architectural argument — layers fighting each other rather than cooperating.
The solution was careful tuning of time constants and adaptive modulation of per-layer learning rates. Faster layers were gently slowed down; slower layers were given slightly more aggressive updates. The goal was not to synchronize them perfectly — biological brains certainly don’t — but to keep them within a range where their independent learning trajectories remained compatible.
The third problem was energy landscape traps. During early training, the system would sometimes settle into stable but useless configurations — local minima in the energy landscape where predictions were internally consistent but bore no relationship to the actual data. The system wasn’t learning; it was hallucinating in agreement with itself.
We addressed this with noise injection during settling, analogous to simulated annealing. By adding controlled stochastic perturbations to the settling process, we gave the system the ability to escape shallow energy minima and continue exploring until it found deeper, more meaningful attractors. As training progressed and the representations stabilized, we gradually reduced the noise, letting the system settle cleanly into the structures it had discovered.
These were growing pains, not fundamental flaws. Every novel architecture encounters its own unique failure modes — the interesting part is that FENA’s failure modes are the same ones neuroscience has identified in biological neural circuits. Precision dysregulation, cross-layer desynchronization, pathological attractor states — these are all studied phenomena in computational neuroscience. We weren’t just debugging software. We were recapitulating problems that evolution already solved.
The First Real Results
After working through the early challenges, the system began to show genuine learning — not just decreasing error metrics, but qualitative changes in how the hierarchy organized itself.
The most telling indicator was the structure that emerged across layers. The lowest layers, closest to raw data, developed representations tuned to local statistical regularities — short-range token co-occurrences, common subsequences, surface-level patterns. Higher layers, receiving only prediction errors from below, developed progressively more abstract representations. They weren’t encoding what the data looked like — they were encoding the structure that explained why the data looked that way.
This hierarchical specialization wasn’t programmed. Nobody told layer three to handle syntax while layer five handles semantics. It emerged naturally from the architecture — each layer learning to explain away the prediction errors passed to it from below, which are precisely the errors that lower layers couldn’t resolve with their simpler representations.
Prediction errors across the hierarchy dropped by roughly an order of magnitude over the course of training. But the more important metric was settling efficiency. Where the untrained system needed dozens of iterations to reach equilibrium, the trained system settled in five or six. The predictions were close enough to reality that only minor corrections were needed. The system had become, in a very real sense, a good model of its data.
We also saw the first evidence of generalization. When presented with sequences it hadn’t seen during training — novel combinations of familiar patterns — the system produced accurate predictions. Not perfect, but meaningfully better than chance. The higher-level representations had captured genuine structure, not just memorized training examples.
The moment that made it real: watching the settling dynamics on a held-out sequence and seeing the system confidently predict patterns it had never encountered, with prediction errors barely rising above their baseline. No backward pass. No global loss function. Just local prediction error minimization, propagating through a hierarchy, and the system genuinely understood something about the structure of its input.
Training on Consumer Hardware
One of FENA’s design goals is that it should run on hardware you can actually buy — a single consumer GPU with five to eight gigabytes of VRAM. The training pipeline put this claim to the test.
The results were encouraging. Because FENA uses local learning rules, there is no need to store the full computation graph for a backward pass. In standard deep learning, training consumes dramatically more memory than inference — you need to cache every intermediate activation for gradient computation. In FENA, each layer computes and applies its weight update using only its own local state: its prediction, the incoming signal, and the resulting error. Once the update is applied, that information can be released.
The practical consequence is remarkable: FENA’s memory footprint during training is essentially the same as during inference. There is no memory cliff when you switch from eval mode to training mode. The system that runs on your GPU for inference also trains on your GPU, with no additional memory overhead for gradient storage or activation caching.
VRAM usage during our training runs stayed comfortably within the target range. The bottleneck, such as it was, came from the settling process itself — each training step requires multiple settling iterations, which means more sequential computation per example than a single forward-backward pass in a transformer. Training throughput, measured in examples per second, was lower than a comparably sized transformer. But the memory efficiency more than compensated, allowing us to train models on a single card that would require multi-GPU setups under standard training paradigms.
This is an architectural advantage that scales. As models get larger, the memory savings from avoiding backward-pass activation storage become increasingly significant. FENA’s training memory grows linearly with model size. Standard backpropagation training grows with both model size and depth, because every layer’s activations must be retained.
What This Means
FENA can train. That sentence is simple, but its implications are not.
Local learning rules — each layer updating its own weights based solely on its own prediction errors, with no global coordinator, no backward pass, no omniscient optimizer — produce genuine learning on real data. The system develops meaningful hierarchical representations, generalizes to novel inputs, and does it all on consumer hardware.
The training pipeline is fundamentally unlike anything in modern machine learning. There is no loss curve to watch in the traditional sense. Instead, you watch settling dynamics evolve — the system becoming faster, more confident, more efficient in its predictions. Training and inference are unified. The distinction between “learning” and “thinking” dissolves, just as it does in the brain.
We’re still early. The training runs described here used a single module on relatively simple data. Ahead of us lies scaling to richer training data, training the full multi-module architecture, and implementing memory consolidation — the process by which short-term learning gets compressed into long-term knowledge structures during offline “rest” periods, mirroring what the brain does during sleep.
But the foundation is proven. The theory works in practice, not just on paper. Local learning rules converge. Hierarchical representations emerge. Predictions improve. The system learns.
The brain never needed backpropagation. Now neither do we.