The Uncanny Valley of Neural Networks

Modern neural networks are called “neural,” but that’s mostly marketing. The individual components — the normalization layers, the attention mechanisms, the activation functions, the learning algorithm — share almost nothing with actual neurons. They’re engineering solutions to engineering problems, designed to make gradient descent work, not to model the brain.

Other posts on this blog cover the big ideas behind FENA: predictive coding, free energy minimization, continuous-time dynamics, memory consolidation. This post is different. This is the parts list. We’re going component by component through a standard neural network, explaining what’s biologically broken about each piece, and showing what FENA replaces it with.

Think of it as surgery. The patient is a transformer. We’re removing organs and replacing them with ones that actually belong in a biological body.

LayerNorm: The Normalization That Neurons Never Do

Every modern transformer uses Layer Normalization. The idea is simple: for each input, compute the mean and variance across all neurons in a layer, then normalize. This keeps activations from exploding or collapsing during training. It works beautifully for gradient flow.

It’s also biologically absurd.

LayerNorm requires every neuron in a layer to simultaneously know the activity of every other neuron in that layer. It’s a global computation — you need the mean and variance of the whole population before any single neuron can produce its output. No neural circuit in any brain has this capability. Neurons don’t have a conference call before they fire.

The biological replacement: Divisive Normalization. Instead of global statistics, each neuron divides its own response by the pooled activity of its local neighbors. Neuron fires strongly, but its neighbors are also firing strongly? Response gets suppressed. Neuron fires strongly while neighbors are quiet? Response stays high.

This is one of the most well-documented computations in all of neuroscience. Carandini and Heeger described it as a “canonical neural computation” in 2012 — it shows up everywhere. The retina uses it to adapt to lighting conditions. Primary visual cortex uses it for contrast normalization. The auditory cortex uses it to handle varying loudness. The olfactory system uses it to normalize across odor concentrations. It’s evolution’s favorite trick.

Divisive normalization achieves the same practical goals as LayerNorm — preventing runaway activation, maintaining dynamic range, improving signal quality — but through purely local computation. Each neuron only needs to know about its immediate neighbors, not the entire layer.

FENA also implements a second mechanism: homeostatic plasticity. Over longer timescales (not per-input, but over training time), each neuron adjusts its own excitability to maintain a target firing rate. If a neuron has been too active recently, it becomes less excitable. If it’s been too quiet, it becomes more excitable. This provides slow, stable self-regulation that complements the fast normalization of divisive inhibition.

Two local mechanisms replacing one global hack. The result is more stable, more efficient, and actually resembles what’s happening in cortical tissue.

Attention: The Mechanism That Attends to Nothing Like the Brain Does

Transformer attention is one of the most successful innovations in AI history. It’s also one of the most misleadingly named.

“Attention” in a transformer means this: take every position in a sequence, compute a compatibility score with every other position, normalize those scores into weights, and use the weights to create a blended representation. It’s matrix multiplication. It’s a learned soft lookup table. What it isn’t is anything like how the brain selects, prioritizes, and focuses on information.

The problems are fundamental. Standard attention computes a global compatibility matrix between all positions — O(n²) in sequence length, requiring every position to “see” every other position simultaneously. It uses weight transport for learning (backpropagating through the attention matrix). And the actual mechanism — scaled dot-product of learned projections — has no biological correlate. No neuroscientist has ever found a circuit that computes query-key-value dot products.

The biological replacement: Precision-Weighted Prediction Errors. In the brain, “attention” isn’t a separate mechanism bolted onto perception. It’s built into the prediction error signaling that drives all cortical processing.

Here’s how it works: every prediction error in the cortical hierarchy carries a precision estimate — a measure of how reliable that error signal is. High precision means “this error is trustworthy, update your model strongly.” Low precision means “this error is probably noise, mostly ignore it.” Attention, in the brain, is the process of adjusting these precision weights.

When you focus your attention on something — a voice in a crowded room, a moving object in your peripheral vision — your brain is cranking up the precision on prediction errors from that source. The errors get amplified. The model updates more aggressively. You perceive more detail. Meanwhile, precision on unattended channels gets turned down. Prediction errors from the background chatter get suppressed. You barely process them.

This is modulated by neuromodulators: acetylcholine increases the precision of sensory prediction errors (sensory attention), while dopamine modulates precision of reward-related predictions (motivational attention). It’s pharmacological gain control, not matrix multiplication.

The critical advantage: no global compatibility matrix. Each prediction error is locally weighted by its own estimated reliability. The system doesn’t need to compare everything to everything else. It just needs to know, at each point, how much to trust the signal it’s receiving. This is what Feldman and Friston formalized in their 2010 work on attention as precision optimization within the active inference framework.

FENA implements precision weighting directly in its predictive coding hierarchy. Attention isn’t a separate module — it’s an intrinsic property of how prediction errors propagate. The system naturally “attends” to surprising, reliable signals and ignores predictable or noisy ones.

Positional Encodings: The Hack That Replaced Spatial Understanding

Transformers have a peculiar architectural quirk: their core attention mechanism is permutation-invariant. Shuffle the tokens in a sentence, and the raw attention computation doesn’t notice. “The cat sat on the mat” and “mat the on sat cat the” produce identical attention patterns before positional information is injected.

The standard fix is positional encodings — sinusoidal functions or learned embedding vectors that get added to each token to tell the model “you’re at position 3” or “you’re at position 47.” It works. But it’s a bizarre solution when you think about it: position is treated as a tag bolted onto the data, not as an intrinsic property of the representation.

The brain doesn’t tag things with position numbers. It doesn’t need to, because position is woven into the fabric of neural representation itself.

The biological replacement: Phase-Based Coding and Grid Cells. The brain uses at least two elegant mechanisms for encoding position and sequence.

Phase coding encodes position through timing. In the hippocampus, neurons representing locations fire at specific phases of the ongoing theta oscillation (4-8 Hz). As an animal moves through a place field, the neuron fires progressively earlier in each theta cycle — a phenomenon called theta phase precession, discovered by O’Keefe and Recce in 1993. Position isn’t a label. It’s a temporal relationship between a spike and an oscillation.

Grid cells, discovered by the Mosers (earning their 2014 Nobel Prize), provide intrinsic coordinate systems. These neurons fire in regular hexagonal patterns as an animal moves through space, creating a self-generated spatial map that doesn’t depend on external landmarks. Different grid cells have different spacings and orientations, forming a multi-scale coordinate system. Recent computational work has shown that grid-cell-like representations spontaneously emerge in neural networks trained on spatial navigation tasks — suggesting this isn’t an evolutionary accident but a computational optimum.

For temporal sequences (the closer analog to language), the brain uses slowly drifting temporal context signals. Neural populations in the lateral entorhinal cortex and prefrontal cortex maintain gradually evolving representations that encode “when” — not as a discrete position index, but as a continuously shifting context. Items close in time share similar contexts. Items far apart have different contexts. Sequence information emerges from the dynamics, not from injected tags.

In FENA, position and sequence information emerge from the continuous-time dynamics and phase relationships between oscillating nodes. There are no positional encoding vectors. Position is intrinsic to the computation — a property of when and how nodes activate relative to each other, not a number added to their output.

Backpropagation: The Specific Local Rules That Replace It

The case against backpropagation has been made elsewhere on this blog and across decades of neuroscience literature. The short version: it requires a global error signal propagated backward through every layer, with each layer needing access to the weights of every other layer (weight transport). The brain doesn’t have the wiring for this.

What hasn’t been covered in detail is exactly which local learning rules FENA uses instead, and why each one matters. This isn’t a single replacement — it’s a toolkit of complementary mechanisms, each addressing a different aspect of learning.

Hebbian learning is the foundation: neurons that fire together wire together. When two connected neurons are active at the same time, their connection strengthens. Simple, local, and well-documented since Hebb proposed it in 1949. But raw Hebbian learning has a fatal flaw — it’s unstable. Connections only get stronger, activity only increases, and the network eventually saturates into useless maximal activation.

Anti-Hebbian learning provides the necessary counterbalance. While Hebbian learning strengthens connections between co-active neurons, anti-Hebbian learning decorrelates them — ensuring that different neurons learn to represent different things rather than all converging on the same features. This is essential for forming efficient, non-redundant representations.

Spike-Timing-Dependent Plasticity (STDP) adds directionality. Pure Hebbian learning is symmetric — it doesn’t care about the order of activation. STDP does. If neuron A fires just before neuron B, the A→B connection strengthens (A might be causing B). If A fires just after B, the connection weakens (A isn’t causing B). This millisecond-precision timing rule creates directional, causal learning — the connection strengthens in the direction of information flow.

Prediction error modulation scales learning rates locally. Neurons in regions with high prediction error — where the model is failing — learn faster. Neurons in regions with low prediction error — where predictions are already accurate — learn slowly or not at all. This is efficient: don’t waste plasticity on what you already know.

BCM theory (Bienenstock-Cooper-Munro) prevents saturation through a sliding threshold. Each neuron maintains a dynamic threshold: if recent activity is high, the threshold for potentiation rises (making it harder to strengthen connections). If recent activity is low, the threshold drops (making strengthening easier). This elegantly prevents the runaway excitation problem without requiring any global coordination.

These rules aren’t approximations cobbled together to mimic backpropagation. They’re the actual mechanisms evolution discovered over hundreds of millions of years of neural development. Recent theoretical work has shown that predictive coding networks using local Hebbian rules can approximate the results of backpropagation under certain conditions — but the local rules are more fundamental. Backpropagation, if anything, is the approximation.

The Activation Function Nobody Questioned

Deep learning researchers spend enormous effort choosing architectures, optimizers, and training schedules. But the activation function — the nonlinearity applied at each neuron — is usually picked from a short list (ReLU, GELU, SiLU) based primarily on one criterion: how well does it propagate gradients?

ReLU (Rectified Linear Unit) passes positive values unchanged and zeros out negatives. GELU adds a smooth probabilistic gate. Both are designed for gradient flow during backpropagation. Neither has any relationship to how biological neurons actually respond to input.

Real neurons are vastly more complex. They exhibit adaptation — their response to sustained input decreases over time. Present the same stimulus continuously, and a neuron’s firing rate drops. It gets “bored.” This isn’t a bug; it’s a feature. Adaptation makes neurons sensitive to changes rather than absolute values, dramatically improving the efficiency of information coding.

Real neurons also have refractory dynamics — a brief period after firing during which firing again is difficult or impossible. This creates natural rate limits and introduces history-dependence into the neuron’s response. The output doesn’t just depend on the current input; it depends on what happened milliseconds ago.

The biological replacement: FENA uses activation dynamics rather than activation functions. The distinction matters. A function is a static mapping: input in, output out, memoryless. Dynamics are state-dependent: the neuron’s response depends on its recent history, its current adaptation level, and its refractory state.

These properties emerge naturally from FENA’s continuous-time formulation. Because each node evolves according to differential equations rather than being evaluated as a static function, history-dependent behaviors like adaptation and refractoriness arise from the mathematics without being explicitly programmed. The neuron doesn’t need a special “adaptation module” — it’s a natural consequence of having dynamics instead of functions.

Why Component-Level Plausibility Matters

It would be easy to dismiss all of this as aesthetics — making things “brain-like” for the sake of it. But each replacement solves a real engineering problem that conventional components handle poorly.

Divisive normalization instead of LayerNorm enables normalization without global coordination — critical for systems that need to scale without centralized computation. It also provides richer normalization that adapts to local context rather than treating the entire layer as a uniform population.

Precision-weighted prediction errors instead of attention eliminate the O(n²) scaling bottleneck and provide a principled mechanism for resource allocation. The system doesn’t need a learned router to decide what’s important — importance is determined by prediction error magnitude and precision. This is more robust, more efficient, and naturally adapts to novel inputs.

Phase-based coding instead of positional encodings means that position and structure are intrinsic to the representation rather than bolted on. The system can handle variable-length sequences, continuous spatial inputs, and multi-scale temporal structure without architectural modifications.

Local learning rules instead of backpropagation eliminate the need for weight transport, enable online learning (no separate training and inference phases), and naturally resist catastrophic forgetting — because learning is local, updating one part of the network doesn’t systematically distort representations elsewhere.

Dynamic activations instead of static functions provide richer computational primitives that naturally encode temporal context and change sensitivity.

The conventional components weren’t designed because they were good models of computation. They were designed because they made backpropagation work. They’re scaffolding around a specific optimization algorithm. When you remove that algorithm — when you replace backpropagation with local learning rules — you can remove the scaffolding too. And what you put in its place turns out to be simpler, more efficient, and more capable.

FENA isn’t a neural network with biological paint. It’s a biological system implemented in silicon. Every component earns its place not by tradition, but by working the way intelligence actually works.