Let’s Have an Honest Conversation

The transformer architecture has conquered AI. From language models to image generators to code assistants, a single architectural idea — attention over sequences — has become the default answer to almost every problem in machine learning. And for good reason.

But here’s a question worth sitting with: is one architecture really the right answer for everything?

We’re building FENA, a brain-inspired alternative. And we think the most productive thing we can do is lay out a genuinely fair comparison — not a sales pitch, not a takedown, but an honest look at what both approaches do well and where each one struggles.

What Transformers Got Right

Credit where it’s due: transformers earned their dominance.

Scaling works. One of the most remarkable discoveries in modern AI is that transformers get reliably, predictably better as you add more data and more compute. Scaling laws give you something close to a recipe — spend X, get Y improvement. That predictability is enormously valuable for organizations making billion-dollar investment decisions.

One architecture, many problems. The same basic transformer design handles natural language, code generation, image understanding, protein folding, and multimodal reasoning. That kind of generality is rare in engineering. You don’t usually get a single tool that works this well across so many domains.

The ecosystem is deep. Tooling, libraries, deployment infrastructure, fine-tuning pipelines, community knowledge — the transformer ecosystem is mature and battle-tested. If you want to ship an AI product today, the path of least resistance runs through transformers, and there’s nothing wrong with choosing a well-paved road.

The attention mechanism is elegant. The ability to relate any token to any other token in a sequence — dynamically, based on learned relationships — is a genuinely beautiful idea. It solved limitations that held back earlier recurrent architectures and unlocked capabilities that surprised even its creators.

They simply work. This matters more than any theoretical argument. Transformers produce results that are useful, impressive, and commercially viable across a staggering range of applications. Empirical success is the hardest thing to argue with, and we’re not going to try.

The Cracks in the Foundation

But architectural dominance doesn’t mean architectural perfection. Some of these limitations are well-known; others are only now becoming clear as we push transformers further.

Compute scales poorly with context. The standard attention mechanism is quadratic in sequence length. Doubling your context window quadruples the cost. There are workarounds — sparse attention, sliding windows, various approximations — but they’re patches on a fundamental constraint, not solutions to it.

No world model, just statistics. Transformers are extraordinary pattern matchers, but they don’t build internal models of how the world works. They predict the next likely token based on statistical regularities in training data. This is why they hallucinate with total confidence — there’s no internal consistency check, no model of reality to flag that something doesn’t make sense.

Training concentrates power. When a single training run costs tens of millions of dollars and requires thousands of specialized GPUs, only a handful of organizations can play the game. This isn’t just an economic concern — it shapes what gets built, who benefits, and whose values get embedded into these systems.

Frozen at inference. Once a transformer is trained, its weights are fixed. It can’t learn from the conversation it’s having with you right now. Every interaction is stateless from the model’s perspective — it draws from what it learned during training and nothing else. Techniques like fine-tuning and RAG help, but they’re external scaffolding, not native learning.

Fixed compute per token. A transformer spends the same computational effort generating the word “the” as it does working through a subtle logical deduction. It can’t allocate more thinking time to harder problems — every token gets the same-sized forward pass, regardless of difficulty.

FENA: A Different Bet

FENA isn’t an incremental improvement on transformers. It’s a fundamentally different wager about how intelligence should be structured — one inspired by how biological brains actually process information.

Where transformers allocate fixed computation per token, FENA settles dynamically. Its energy-based processing continues until the network reaches a stable state. Simple inputs resolve quickly; complex ones get more processing time automatically. The architecture adapts its compute budget to the problem, rather than forcing every problem through the same fixed pipeline.

Where transformers freeze after training, FENA learns continuously. Local learning rules — inspired by how biological synapses strengthen and weaken — allow the network to update its understanding during inference. It doesn’t need a separate training phase to incorporate new information. Learning and reasoning aren’t separate modes; they’re the same process.

Where transformers predict the next token, FENA builds a world model. Its predictive coding foundation means the network is constantly generating expectations about its inputs and updating itself based on prediction errors. This isn’t pattern matching — it’s modeling. When predictions and reality diverge, the network knows something is wrong and adjusts accordingly.

Where transformers need data center hardware, FENA targets consumer devices. The architecture is designed from the ground up to run efficiently on the hardware people already own. No API keys, no subscriptions, no renting intelligence from someone else’s cloud.

And where transformers use quadratic attention to bind information across a sequence, FENA uses oscillatory binding — a mechanism inspired by how biological neurons synchronize through rhythmic firing patterns. It’s a different solution to the same problem, with fundamentally different scaling properties.

Where FENA Is Still Behind

Honesty cuts both ways. FENA has real limitations that we’d be foolish to gloss over.

It hasn’t proven itself at scale. Transformers have been tested at billions of parameters, trillions of tokens, across thousands of applications. FENA is early-stage. The architecture is principled, the theory is sound, but large-scale empirical validation is still ahead of us. Scaling properties that look good in theory need to be confirmed in practice.

The ecosystem is nascent. There’s no equivalent of Hugging Face for FENA. No pre-trained model zoo, no community-maintained fine-tuning scripts, no stack overflow threads about common gotchas. Anyone working with FENA today is pioneering, with all the friction that implies.

Best practices don’t exist yet. Transformers benefit from years of accumulated knowledge about learning rates, batch sizes, regularization strategies, and architecture variants. FENA is still in the phase where fundamental questions about training methodology are being answered for the first time.

Some capabilities are unproven. Transformers have demonstrated impressive in-context learning, instruction following, and chain-of-thought reasoning. FENA’s theoretical framework suggests these capabilities should emerge, but “should” and “does” are separated by a lot of engineering work.

Why Architectural Diversity Matters

Step back from the comparison for a moment and consider the bigger picture.

Biology doesn’t use one neural architecture for every organism. Insect brains, octopus brains, and mammalian brains solve intelligence differently — because different environments and constraints demand different solutions. Monoculture is fragile, whether you’re talking about agriculture, financial systems, or AI architectures.

The transformer monoculture has real costs. When everyone builds on the same foundation, everyone inherits the same blind spots. The same failure modes propagate everywhere. And the incentive to explore genuinely different approaches — approaches that might solve problems transformers can’t — dries up.

Competition drives innovation. The existence of viable alternatives forces the dominant paradigm to improve, and sometimes reveals that the dominant paradigm was the wrong tool for certain jobs all along. The best future AI systems may well combine ideas from both approaches — transformer-style attention for some problems, brain-inspired dynamics for others.

Diversity isn’t just nice to have. It’s how complex systems stay healthy.

What This Means for You

If you’re a developer or researcher, we’re not asking you to abandon transformers. They’re powerful, proven, and often the right choice for the problem in front of you today.

What we are saying is that the landscape is wider than it looks. Brain-inspired architectures like FENA represent a genuinely different approach to intelligence — one that makes different tradeoffs, solves different problems naturally, and opens possibilities that the transformer paradigm structurally can’t reach.

We’re building FENA in the open because we think the AI community deserves options. Not promises — working alternatives that people can evaluate, test, and build on for themselves.

The best architecture is the one that solves your problem. We’re working to make sure transformers aren’t the only one available.

The Sulphur Team