Why a Roadmap Matters
Most AI labs follow the same playbook: train the biggest model you can afford, benchmark it against everything, ship whatever sticks. There’s no public theory of how capabilities should emerge, no explicit ordering, no deliberate progression. Scale is the strategy. Hope is the roadmap.
FENA takes a different path. We believe intelligence is structured — that higher capabilities genuinely depend on lower ones, that you can’t reason about causes if you can’t perceive the world, and that you can’t solve novel problems if you can’t plan or reflect on your own thinking. This isn’t just philosophy. It’s an engineering constraint. Build the foundation wrong, and the upper floors collapse.
That’s why FENA follows a deliberate sixteen-milestone capability ladder. Each milestone represents a genuine cognitive capability — not a benchmark score, not a marketing feature, but a functional ability that serves as a prerequisite for the next. The ladder starts with basic text comprehension and climbs, rung by rung, to ARC-AGI-3: a benchmark specifically designed to resist pattern matching and demand genuine general intelligence.
The Ladder
The sixteen milestones aren’t arbitrary. They fall into three natural tiers that reflect the progression from perception to cognition to mastery:
Foundation (Milestones 1–6) covers perception and multimodal integration — the system’s ability to take in the world through text, images, audio, and video, and to fuse those streams into a unified understanding.
Cognition (Milestones 7–11) covers higher reasoning — abstract thought, planning, self-directed learning, causal inference, and the ability to monitor and evaluate its own thinking.
Mastery (Milestones 12–16) covers the capabilities that emerge when perception and cognition work together at scale — creative generation, tool use, complex problem solving, autonomous self-improvement, and ultimately the kind of flexible general intelligence that ARC-AGI-3 is designed to measure.
Each tier builds on the last. You can’t skip tiers. You can’t fake the lower rungs by scaling the upper ones. The ladder is the plan.
Foundation — Milestones 1 Through 6
Milestone 1: Text Understanding
Every intelligence starts somewhere, and for FENA, it starts with language. But not language as token prediction — language as comprehension. The goal at this milestone is for the system to build genuine internal representations of written text through predictive coding, where understanding emerges from the iterative process of predicting, encountering errors, and refining mental models.
This is a different standard than “generate plausible text.” A system can generate fluent paragraphs while understanding nothing. Milestone 1 requires that the system develop internal representations that capture meaning — relationships between concepts, logical structure, implicit assumptions. The predictive coding hierarchy is the engine here: the system doesn’t just read text, it builds a layered model of what the text is about.
Success at this stage means strong performance on standard natural language understanding benchmarks — not through memorization or surface-level pattern matching, but through representations that generalize to unseen text and novel phrasings.
Milestone 2: Conversational Ability
Understanding a single passage is one thing. Maintaining coherent, contextually aware dialogue across multiple turns is something else entirely. Milestone 2 requires the system to track the evolving state of a conversation — what’s been said, what’s been implied, what the speaker’s goals are, and how the context shifts as the dialogue progresses.
This is where working memory becomes critical. The system must hold a dynamic representation of the conversation’s state, update it with each turn, and use it to generate responses that are not just locally coherent but globally consistent. It’s the difference between answering each message in isolation and actually following the thread.
Success means sustained, multi-turn dialogue where the system maintains context, resolves ambiguities, and tracks the conversation’s trajectory — without losing the thread or contradicting itself five turns in.
Milestone 3: Image Perception
With language in place, the system opens its eyes. Milestone 3 introduces visual understanding through JEPA-based encoding — the system learns abstract visual representations by predicting masked regions of images in latent space, not by reconstructing pixels. This distinction matters: pixel reconstruction forces the system to model irrelevant details (exact textures, lighting conditions), while latent prediction forces it to capture the abstract structure of what it’s seeing.
The goal isn’t image classification. It’s scene understanding — recognizing objects, understanding spatial relationships, inferring context, and building the kind of rich visual representation that supports downstream reasoning. A system that can label an image “dog” but can’t tell you the dog is sitting on a porch looking at a squirrel has missed the point.
Success means accurate, structured scene understanding that goes beyond classification to capture relationships, context, and visual semantics.
Milestone 4: Audio Processing
Intelligence doesn’t live in text and images alone. Milestone 4 adds auditory perception — the system processes speech and environmental sound as continuous streams, not as transcribed text fed into a language model. This is a crucial distinction: real audio understanding means processing temporal dynamics, tone, rhythm, emphasis, and environmental context directly from the waveform.
The continuous-time dynamics engine is essential here. Audio is inherently temporal, and the system must track patterns that unfold over milliseconds (phonemes) to seconds (words, sentences) to minutes (conversations, musical pieces). The multi-timescale processing that emerges from neural ODE dynamics is tailor-made for this.
Success means robust speech recognition and environmental sound understanding — processing audio natively, not as a detour through text transcription.
Milestone 5: Video Understanding
Video combines the challenges of visual perception and temporal processing. The system must track objects, events, and changes as they unfold over time — understanding not just what’s in each frame but how the scene evolves, what’s happening, and what might happen next.
This milestone is where the continuous-time dynamics engine truly proves its value. Video understanding requires modeling at multiple temporal scales simultaneously: fast changes (a hand reaching for a cup), medium dynamics (a person walking across a room), and slow context (a meeting that unfolds over minutes). The system must integrate all of these into a coherent representation of what’s going on.
Success means understanding sequences of events in video — tracking objects, recognizing actions, inferring goals, and maintaining a temporally coherent model of the scene.
Milestone 6: Full Multimodal Integration
The first five milestones build separate perceptual channels. Milestone 6 fuses them. This isn’t a simple concatenation — “text features plus image features plus audio features” — it’s genuine multimodal cognition, where the system maintains a unified world state that integrates information from all modalities simultaneously.
This is one of the hardest milestones in the foundation tier. Current AI systems that claim multimodal capability are typically language models with vision adapters bolted on — text is the primary modality, and everything else is translated into text before it can be processed. FENA’s architecture treats all modalities as first-class inputs to a shared world model. The oscillatory binding mechanism is what makes this possible: different modalities are bound together through phase synchronization, creating unified percepts that are genuinely multimodal rather than sequentially processed.
Success means coherent understanding that draws on visual, auditory, and textual information simultaneously — not three separate answers stitched together, but one integrated understanding.
Cognition — Milestones 7 Through 11
Milestone 7: Abstract Reasoning
With a solid perceptual foundation, the system begins to think. Milestone 7 targets abstract reasoning — the ability to recognize patterns, draw logical inferences, and identify structural similarities across different domains. This isn’t about answering logic puzzles that appeared in the training data. It’s about genuine generalization: seeing a pattern in one context and recognizing the same structure in a completely different context.
The reasoning core drives this capability — but the key insight is that abstract reasoning depends on the quality of the representations built in the foundation tier. You can’t reason abstractly about things you perceive poorly. The layered, predictive representations from milestones 1 through 6 provide the raw material that the reasoning engine operates on.
Success means strong performance on abstract reasoning benchmarks that require genuine generalization — tasks where memorization and surface-level pattern matching demonstrably fail.
Milestone 8: Planning and Prediction
Reasoning about the present is milestone 7. Reasoning about the future is milestone 8. Planning requires the system to imagine multiple possible futures, evaluate their likelihood and desirability, and select a course of action — all before taking a single step. This is world-model reasoning: using an internal model of how things work to simulate what would happen if.
The RSSM-based world model is the architectural backbone here. The system maintains a latent state that represents its current understanding of the world, and it can roll that state forward to predict what will happen next — under different actions, different assumptions, different contingencies. Planning becomes a search through imagined futures.
Success means multi-step plans that account for contingencies, adapt to changing conditions, and demonstrate genuine forward thinking rather than reactive pattern matching.
Milestone 9: Self-Directed Learning
Most AI systems are passive learners — they learn from whatever data they’re fed. Milestone 9 makes the system an active learner. Driven by curiosity (operationalized as active inference — seeking out experiences that maximally reduce uncertainty), the system doesn’t just learn from what it encounters. It decides what to learn next.
This is a profound shift. A self-directed learner can identify gaps in its own knowledge, seek out information that fills those gaps, and prioritize learning experiences based on their expected information gain. It’s the difference between a student who reads whatever textbook is handed to them and a student who identifies their weaknesses and goes to the library.
Success means autonomous exploration that leads to measurable capability improvements without human guidance — the system gets better on its own because it’s actively seeking out the experiences it needs.
Milestone 10: Causal Reasoning
Correlation is not causation — every statistics student learns this, but no current AI system truly internalizes it. Milestone 10 targets causal reasoning: the ability to understand cause-and-effect relationships, answer counterfactual questions (“what would have happened if…”), and distinguish genuine causal mechanisms from mere statistical associations.
This is where the Contrastive Causal Discovery mechanism earns its place. The system learns to identify causal structure by comparing scenarios, isolating variables, and testing interventions in its world model. This isn’t just a nice-to-have — causal reasoning is the foundation of scientific thinking, practical problem-solving, and robust generalization. A system that only knows correlations will fail catastrophically in novel environments where the correlations change but the causes don’t.
Success means correct causal inference on novel scenarios — identifying causes, predicting the effects of interventions, and answering counterfactual questions that require genuine understanding of mechanism.
Milestone 11: Meta-Cognition
The most underrated capability in the ladder. Meta-cognition means the system can monitor its own thinking — it knows what it knows, knows what it doesn’t know, and adjusts its behavior accordingly. When it’s confident, it acts decisively. When it’s uncertain, it seeks more information, considers alternatives, or signals its uncertainty honestly.
This is the prerequisite for trustworthy AI. A system without meta-cognition will give wrong answers with supreme confidence and right answers with unnecessary hedging. A system with meta-cognition calibrates: its confidence tracks its actual accuracy, its uncertainty signals are meaningful, and its behavior adapts to the reliability of its own reasoning.
Success means well-calibrated confidence scores, appropriate uncertainty signaling, and demonstrable behavioral adaptation based on self-assessment — the system acts differently when it’s sure versus when it’s guessing.
Mastery — Milestones 12 Through 16
Milestone 12: Creative Generation
Creativity isn’t randomness, and it isn’t remix. Milestone 12 targets genuine novelty — the ability to generate outputs across modalities (text, images, audio, structured data) that are original, coherent, and grounded in world understanding. The system doesn’t create by randomly recombining elements from its training data. It creates by understanding the space of possibilities and exploring regions that are meaningful but unexplored.
This milestone depends heavily on the world model (milestone 8) and abstract reasoning (milestone 7). Genuine creativity requires understanding what’s possible, what’s interesting, and what’s coherent — then finding something that satisfies all three constraints while being genuinely new.
Success means outputs that human evaluators judge as creative and original — not because they’re random or surprising, but because they demonstrate understanding and imagination working together.
Milestone 13: Tool Use
No intelligence operates in a vacuum. Milestone 13 gives the system the ability to extend its own capabilities by learning to use external tools, APIs, and environments. This isn’t about hardcoding API calls — it’s about the system understanding what tools are available, what they can do, and when to use them to accomplish goals it couldn’t achieve with its built-in capabilities alone.
Tool use requires planning (milestone 8), understanding cause and effect (milestone 10), and meta-cognition (milestone 11 — knowing when your own capabilities are insufficient). It’s a natural integration point for the cognitive milestones: the system must reason about its own limitations, plan sequences of tool interactions, and adapt when tools don’t behave as expected.
Success means autonomous tool selection and use for novel tasks — the system figures out which tool to use and how, without being told.
Milestone 14: Multi-Step Problem Solving
This is where everything comes together. Milestone 14 targets complex, multi-stage tasks that require planning, execution, monitoring, and adaptation — all in concert. Think of a task like “research a topic, synthesize findings, write a report, and revise it based on feedback.” Each step requires different capabilities, the output of each step constrains the next, and things can go wrong at any point requiring replanning.
Multi-step problem solving is the integration test for the entire architecture. It demands perception (understanding the problem), reasoning (figuring out an approach), planning (sequencing steps), tool use (leveraging external resources), meta-cognition (monitoring progress and catching errors), and adaptation (replanning when things don’t go as expected).
Success means end-to-end completion of complex real-world tasks that require sustained, coherent effort across multiple stages — without hand-holding, without step-by-step instructions, and with graceful recovery from unexpected obstacles.
Milestone 15: Autonomous Learning and Adaptation
Milestone 9 introduced self-directed learning within a bounded scope. Milestone 15 extends it to continuous, open-ended self-improvement across all capabilities. The system refines its own world model, strengthens its reasoning, improves its perceptual accuracy, and develops new skills — all without human supervision, indefinitely.
This is the milestone that separates a tool from an agent. A tool does what you tell it. An agent that continuously learns and adapts becomes more capable over time, develops expertise in domains it’s deployed in, and improves in ways its designers didn’t explicitly program. The local learning rules — Hebbian plasticity, STDP, prediction-error-driven updates — make this architecturally possible: every experience updates the system, and the updates accumulate into genuine capability growth.
Success means measurable improvement in capabilities over extended autonomous operation — the system demonstrably gets better at its tasks over time, without human intervention.
Milestone 16: ARC-AGI-3
The summit. ARC-AGI-3 is a benchmark specifically designed to be unsolvable through memorization, pattern matching, or statistical correlation. Every task is novel. Every task requires genuine abstraction — identifying the underlying rule from a handful of examples and applying it to new inputs. It’s the closest thing we have to a test of genuine general intelligence.
Why make this the final milestone rather than just another benchmark? Because ARC-AGI-3 is a capabilities test that can only be passed by a system that has genuinely mastered the lower fifteen milestones. It requires perception (understanding the visual inputs), abstract reasoning (identifying the rule), causal reasoning (understanding why the rule produces the outputs it does), meta-cognition (knowing when you’ve found the right rule versus when you’re guessing), and creative problem-solving (applying the rule to inputs you’ve never seen). It’s not a test of one capability. It’s a test of all of them working together.
Success means competitive performance on ARC-AGI-3 — not through brute-force search or massive pretraining, but through the kind of flexible, general reasoning that the entire architecture was designed to enable.
Where We Stand
Let’s be honest about current progress. FENA is still in its early days. The architectural foundations are being built: the predictive coding hierarchy is in place, the continuous-time dynamics engine is operational, the three-tier memory system is complete, and the reasoning core is live. These are the building blocks — the engine that will power the climb.
But having an engine is not the same as having climbed the mountain. The individual components exist and function. The deep integration work — making them all cooperate smoothly, feeding each module’s outputs into the next, tuning the dynamics so the whole system settles into coherent behavior — that work is underway but far from finished.
Think of it as building a brain one region at a time. We’ve built the cortical columns, the hippocampal memory circuits, the prefrontal reasoning structures, and the thalamic relay. Now we need to wire them together and see what emerges when they all run in concert. The early foundation milestones — text understanding, conversation, basic perception — are within reach. The upper milestones are the horizon we’re building toward.
The Road Ahead
We won’t pretend this is easy. Sixteen milestones spanning the full range of cognitive capabilities is an enormously ambitious agenda. The challenges are real: integration complexity grows with every module that comes online, evaluation is hard (measuring genuine understanding versus sophisticated pattern matching is an unsolved problem in its own right), and the gap between architectural theory and practical capability is always wider than you hope.
But the architecture is designed for this climb. Each milestone genuinely enables the ones above it — this isn’t a marketing ladder where capabilities are independent features dressed up as a progression. You literally cannot do causal reasoning without perception and abstract reasoning. You cannot do meta-cognition without having cognition to monitor. The dependencies are real, and building bottom-up means each new capability has a solid foundation to stand on.
And because FENA is designed to run on consumer hardware — a single GPU, not a data center — iteration is fast. We can experiment, fail, learn, and try again without waiting for cluster access or burning through millions in compute. The community will see progress as it happens: each milestone reached, each capability demonstrated, each failure analyzed and shared. This is an open climb.
The Climb Has Begun
Sixteen milestones. Three tiers. One direction: upward.
The foundation tier is where the immediate work happens — building robust perception across modalities and fusing them into unified understanding. The cognition tier is where things get interesting — abstract reasoning, planning, causal inference, and the system’s ability to reflect on its own thinking. The mastery tier is where it all pays off — creative generation, autonomous problem-solving, continuous self-improvement, and ultimately the kind of flexible general intelligence that ARC-AGI-3 demands.
We’re at the base of the ladder. The engine is built. The route is mapped. The climb has begun.
— The Sulphur Team