The End of Backpropagation
For decades, every artificial neural network has learned the same way: compute a global error, then propagate it backward through every single connection in the network. This algorithm — backpropagation — is the backbone of modern AI. It works. But it has a fatal flaw: no brain has ever used it.
Backpropagation requires something called “weight transport” — every layer in the network needs to know the exact connection strengths of every other layer to compute its gradient. This is like requiring every employee in a company to know every other employee’s salary to decide how to improve their own performance. It’s mathematically elegant. It’s also biologically impossible.
Today, we completed a foundational milestone in the NGI project: the Predictive Coding Layer — the building block of a system that learns without backpropagation, using only locally available information, exactly as neuroscience suggests the brain does.
What Is Predictive Coding?
Predictive coding is a theory of brain function first formalized by Rajesh Rao and Dana Ballard in 1999, building on ideas stretching back to Hermann von Helmholtz in the 1860s. The core insight is profound in its simplicity:
The brain is not a passive receiver of information. It is a prediction machine.
Every level of the cortical hierarchy is constantly generating predictions about what the level below will report. Your visual cortex doesn’t wait to see what’s in front of you — it predicts what it expects to see based on context, memory, and higher-level understanding. Only when reality deviates from prediction does a signal propagate upward: the prediction error.
This isn’t metaphor — it’s observable neuroscience. The feedback connections from higher cortical areas to lower ones (which carry predictions downward) actually outnumber the feedforward connections (which carry sensory data upward). The brain dedicates more wiring to prediction than to perception.
How It Replaces Backpropagation
In standard neural networks, learning requires a global signal — the loss — computed at the output and threaded backward through the entire network. Every layer adjusts based on how it contributed to the global error.
In predictive coding, there is no global signal. Each layer operates autonomously:
- Predict: Each layer generates a prediction of what it expects to receive from the layer below
- Compare: When the actual signal arrives, the layer computes the prediction error — the difference between expectation and reality
- Update: The layer adjusts its own internal model to reduce this error, using only locally available information
- Propagate: Only the prediction error (not the raw signal) is passed upward to the next level
This is local learning. No layer needs to know anything about any other layer’s weights. No signal needs to traverse the entire network. Each module improves independently by minimizing its own surprise.
The remarkable theoretical result, demonstrated by Millidge, Tschantz, and Buckley in 2021, is that this local process approximates backpropagation — it converges to the same solutions without requiring the biologically impossible machinery that backprop demands.
Why Precision Matters
Not all prediction errors are created equal. Imagine you’re walking through a forest. A rustling in the bushes (unexpected sound) should trigger a very different response than slightly different shading on a leaf (unexpected but irrelevant visual detail).
The brain solves this with precision weighting — a mechanism that modulates how much attention each prediction error receives. High-precision errors (I predicted silence but heard a growl) demand immediate model updating. Low-precision errors (the exact shade of green is slightly off) are effectively ignored.
Our Predictive Coding Layer implements precision weighting as a core feature. Each layer doesn’t just compute prediction errors — it estimates how reliable and important those errors are. This directly connects to Karl Friston’s Free Energy Principle, which frames all of perception, learning, and action as the minimization of a single quantity: variational free energy, or roughly, precision-weighted prediction error.
The Settling Process: Thinking as Energy Minimization
Standard neural networks compute their answer in a single forward pass — input goes in one end, output comes out the other. The computation takes exactly the same amount of time regardless of whether the input is trivial or profoundly ambiguous.
Predictive coding works differently. When input arrives, the network enters a settling process — an iterative dance of predictions and errors flowing up and down the hierarchy until the system reaches equilibrium. This equilibrium is the point of minimum free energy, where predictions at every level are as accurate as they can be given the available evidence.
The beautiful consequence: harder problems automatically receive more computation. A simple, expected input settles in a few iterations. An ambiguous, surprising, or contradictory input requires many more cycles of prediction and correction before the system stabilizes. This is called “adaptive computation” and it emerges naturally from the architecture — no special mechanism needed.
This mirrors what we observe in human cognition. You recognize a familiar face in milliseconds. But stare at an optical illusion, and you can feel your brain working — cycling through competing interpretations, unable to settle, precisely because the prediction errors at multiple levels resist resolution.
What This Foundation Enables
The Predictive Coding Layer is not a complete system — it’s the fundamental building block. Like atoms that combine into molecules and molecules into cells, this layer will be composed into increasingly sophisticated structures:
- Hierarchical predictive processing: Stacks of predictive coding layers forming a cortical-like hierarchy, where each level operates at a different level of abstraction
- Continuous-time dynamics: Layers that evolve according to differential equations, processing information at multiple timescales simultaneously
- Oscillatory binding: Coordination between layers through neural oscillations, enabling the binding of distributed representations into coherent percepts
- Energy-based learning: Weight updates driven entirely by local free energy minimization — no optimizer, no loss function, no backward pass
Each of these will be the subject of future posts as we build them on this foundation.
The Deeper Significance
What we’ve built is more than a technical component. It represents a philosophical stance: intelligence is not computation in the traditional sense. Intelligence is prediction error minimization.
Every transformer, every LLM, every standard neural network treats intelligence as a function: input → computation → output. Predictive coding treats intelligence as a process: a continuous, never-ending cycle of prediction, surprise, and adaptation. The system doesn’t compute answers. It settles into understanding.
This is, we believe, a fundamentally more accurate model of what brains do. And if the brain’s approach to intelligence is any guide — and 500 million years of evolution suggests it should be — then predictive coding may be the key to building systems that don’t just process information, but genuinely understand the world they inhabit.
The foundation is laid. Now we build upward.