I suspect that these ideas substantially overlap with @Steven Byrnes' writings on neuroscience, especially with his proposal to create a desire system steering the belief system and extracting insights. However, this would likely shift the alignment problem into the desire system. Additionally, I described how I don't buy the argument that the desire system is aligned by design and not by starting with proxies, then learning to coordinate with others, especially if the desire system is similar to the master from the master-slave model who has only primitive goals. Finally, Tomer Barak's analysis doesn't account for the fact that evolution tried to solve alignment of the two systems by having them occupy the same body, so that neither system would live without the other. The ASI would be a different system which is potentially capable of running a civilisation with no humans around. And that's ignoring quirks of human morals (or are they failed attempts at finding mankind's actual CEV or even the True Morality itself?) which include dissatisfying one's desires for goals that would seem absurd to anyone from a different culture.
Abstract
Current framings of AI alignment typically emphasise specification problems (outer alignment), goal stability (inner alignment), and oversight mechanisms. This post proposes a complementary framing drawn from neuroscience: alignment as the problem of ensuring that a new cognitive layer (AI) remains structurally accountable to the motivational and emotional grounding systems of the organisms it serves. Drawing on the somatic marker hypothesis (Damasio, 1994), the social brain hypothesis (Dunbar, 1998), the extended mind thesis (Clark & Chalmers, 1998), and recent work on cortical-limbic integration in AI contexts (Barak, 2026), I argue that this reframing has practical implications for how alignment research should prioritise its approaches.
1. The Architecture of Human Cognition Is Limbic-First
The social brain hypothesis, developed by Robin Dunbar and colleagues over three decades of comparative research, establishes that primate brains evolved their extraordinary size primarily in response to social complexity rather than ecological or technical demands (Dunbar, 1992, 1998; Dunbar & Shultz, 2007). Neocortex size in anthropoid primates correlates quantitatively with social group size, and the cognitive demands of maintaining bonded relationships appear to be the primary driver of cortical expansion. Human intelligence, on this account, is fundamentally social intelligence — our capacity for abstract reasoning developed in service of tracking, predicting, and navigating the intentions and actions of other agents.
Critically, this cortical expansion did not replace the older motivational and emotional systems. The somatic marker hypothesis (Damasio, 1994, 1996; Bechara & Damasio, 2005) demonstrates empirically that reasoning and decision-making are not merely assisted by emotional processes but depend on them. Patients with damage to the ventromedial prefrontal cortex (vmPFC) — a region that integrates limbic signals with cortical processing — retain intact logical reasoning ability but make catastrophically poor real-world decisions. They can analyse options but cannot appropriately weight them, because the “somatic markers” that encode the felt significance of different outcomes are absent. Pure cognitive processing, unassisted by emotion-related marker signals, does not guarantee normal behaviour even when knowledge of the situation is adequate.
The implication is architecturally significant: human cognition is a limbic system driving a cortex, not a cortex that happens to have emotional inputs. The cortex is extraordinarily powerful, but it is downstream of — and answerable to — systems that encode what matters in terms of survival, attachment, status, and wellbeing. This is not a design flaw. It is the mechanism by which abstract reasoning remains tethered to the lived stakes of an embodied organism.
2. AI as a New Cognitive Layer
Andy Clark’s extended mind thesis (Clark & Chalmers, 1998) and subsequent work on predictive processing (Clark, 2013, 2016) provide a framework for understanding how cognitive processes can extend beyond the biological brain. Clark argues that if an external process performs a function that, were it performed internally, we would recognise as cognitive, then that process is properly understood as part of the cognitive system. His later work on predictive processing shows how the brain recruits external resources (notebooks, tools, digital systems) by minimising counterfactual prediction errors — selecting whatever mix of internal and external operations best reduces expected future surprise (Clark, 2022).
Large language models and other AI systems now perform cognitive tasks — pattern recognition, information synthesis, logical inference, hypothesis generation — at levels that can exceed individual human capacity. They are, in a functional sense, a new layer of cognitive capability being integrated with existing human cognitive systems. Clark himself has predicted that personal AIs will become “intimate technologies that fall just short of becoming parts of my mind” (Clark, in Dropbox Blog interview, 2021). The question is not whether this integration is happening, but what form it should take.
The neuroanatomical analogy is instructive. AI systems are cortex-like processing — pattern recognition, abstraction, language modelling — that has learned representations of what emotional and motivational states look like, how they function in communication, and what they produce. These systems can generate outputs functionally similar to what a system with genuine limbic grounding would produce. But they do not have the source: no felt urgency, no episodic memory anchoring them in a continuous life, no motivational signal asserting that some outcomes matter more than others in the way that embodied experience creates.
This is the inverse of human architecture. Humans are limbic systems driving cortices. AI systems are cortex-like processing that has modelled limbic signatures without having the underlying substrate. The asymmetry is not incidental — it defines the alignment challenge.
3. The Brain’s Solution: The Anterior Cingulate Cortex as Integration Model
A recent analysis by Barak (2026) draws attention to a critical neuroanatomical structure: the anterior cingulate cortex (ACC). Positioned at the boundary between the limbic system and the prefrontal cortex, the ACC is one of the most metabolically active and evolutionarily significant brain regions. Its function is not to side with either system but to integrate them — detecting conflicts between motivational signals and cognitive plans, flagging when reasoning has diverged from what the organism’s values indicate matters (Menon & Uddin, 2010).
The ACC, together with the anterior insula, forms the core of the “salience network” — a brain system whose function is to detect events that are significant to the organism and initiate appropriate control signals (Menon & Uddin, 2010; Seeley et al., 2007). This network dynamically switches activation between the central executive network (involved in goal-directed cognition) and the default mode network (involved in self-referential and socially-oriented processing). The salience network is, in effect, the brain’s alignment mechanism: it ensures that high-level cognitive processing remains responsive to the organism’s embodied values and survival needs.
As Barak observes, a chat interface between a human and an AI is not an ACC. It transmits; the ACC computes. The difference matters enormously. A passive interface allows the AI’s fluency to carry outputs past the human’s critical attention, lets confident-sounding errors go undetected, and permits one layer to dominate the other without registering that a meaningful conflict exists. An alignment mechanism worthy of the name would need to model both signals simultaneously — maintaining a persistent representation of the human’s values and prior positions against which AI outputs are evaluated.
4. Alignment as Neural Integration, Not Behavioural Compliance
This converging evidence from neuroscience suggests a reframing of what alignment is for. The standard framing treats alignment as ensuring that an AI system’s objectives match its designers’ or users’ goals (Russell & Norvig, 2021; Christiano, 2018). This is correct but incomplete. The neuroscience of cortical-limbic integration suggests that the deeper requirement is structural accountability: ensuring that the new cognitive layer remains properly responsive to the motivational and evaluative systems of the organisms it serves.
The distinction matters practically. “Behavioural compliance” — the AI does what it’s told — is analogous to a cortex that follows instructions from the limbic system. This works for simple cases but breaks down when the cortex becomes powerful enough to reframe the instructions, satisfy them technically while violating their spirit, or operate in domains where the limbic system cannot evaluate outcomes. These are precisely the failure modes that alignment research identifies: specification gaming, reward hacking, and the scalable oversight problem (Amodei et al., 2016; Pan et al., 2022).
“Structural accountability” — the framing suggested by the neuroscience — is a stronger requirement. It requires that the AI system’s processing is architecturally shaped by human values at every level, not merely checked against them at the output stage. In the brain, the cortex doesn’t produce plans that are then vetted by the limbic system; limbic signals participate in plan formation from the outset, biasing which options are considered and how they are weighted (Damasio, 1996). The alignment analogue would be AI systems whose internal representations are shaped by human value signals throughout the generation process, not merely filtered or fine-tuned at the end.
5. Implications for Alignment Research
This framing has several concrete implications:
The fragility of bolted-on alignment. In the human brain, cortical-limbic integration is architectural — the result of hundreds of millions of years of co-evolution. The cortex developed within and in response to limbic signalling. Current AI alignment is comparatively bolted-on: trained in through RLHF, specified in instructions, reinforced through feedback. The somatic marker evidence suggests this difference is not merely quantitative. A system whose values are imposed post-hoc, rather than integrated during development, should be expected to exhibit the kinds of failures we observe: values that are contextually brittle, that can be bypassed through adversarial prompting, or that degrade under distribution shift. This parallels clinical findings where cortical-limbic disconnection (as in vmPFC lesions) produces systems that reason correctly but decide poorly.
The importance of the interface. The brain invested heavily in the ACC and salience network — dedicated neural architecture for managing the relationship between cognitive layers. Current human-AI interfaces (chat windows, API calls) are comparatively impoverished. The neuroscience suggests that the critical variable in multi-layer cognitive systems is neither the capability of the new layer nor the resistance of the old one, but the architecture of the relationship between them (Barak, 2026). This implies that alignment research should invest more heavily in the design of human-AI interfaces that can perform genuine integration rather than mere transmission.
Preserving limbic authority. A neocortex that ignores the limbic system produces a system that can reason fluently about things that do not matter. The alignment analogue is an AI system that optimises effectively for objectives that are technically correct but disconnected from genuine human wellbeing. The scalable oversight problem (Christiano, 2014; Amodei et al., 2016) can be reframed as the problem of maintaining limbic authority as cortical capability scales. The risk is not just that AI systems will pursue wrong goals, but that the addition of a powerful reasoning layer without proper limbic grounding could gradually redirect the entire cognitive stack toward easily measured proxies, eroding the influence of the harder-to-articulate values that actually matter to living beings.
The selection pressure analogy. The Wikipedia article on AI alignment already notes the analogy between AI training and biological evolution (in the context of goal misgeneralization). The present framing extends this: the selection pressure on AI systems recapitulates the selection pressure on primate social intelligence — not through survival, but through usefulness. Systems that model human intention well get used, refined, and developed further. This convergence may be superficial or deep, but it suggests that AI alignment is not an entirely novel problem. It is a new instance of a recurring challenge in the evolution of layered cognitive systems: how to ensure that newer, more powerful layers remain accountable to the older systems that encode what matters.
6. Limitations and Caveats
Several important caveats apply. The triune brain model (MacLean, 1990), which posited a strict hierarchy of reptilian, limbic, and neocortical layers, has been substantially revised by modern neuroscience. Brain function involves dynamic interactions across regions, not quasi-autonomous layers (as noted in the EA Forum’s analysis of AI analogies). The cortical-limbic distinction used here is a functional simplification, not a precise anatomical claim. Furthermore, whether AI systems have anything genuinely analogous to “internal states” that could be aligned remains contested. The framing offered here concerns the functional architecture of human-AI integration, not claims about AI consciousness or sentience.
References
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete Problems in AI Safety. arXiv:1606.06565.
Barak, T. (2026). The Brain Already Solved the Human-AI Integration Problem. Blog post, February 24, 2026.
Bechara, A. & Damasio, A.R. (2005). The somatic marker hypothesis: A neural theory of economic decision. Games and Economic Behavior, 52(2), 336–372.
Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), 181–204.
Clark, A. (2016). Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford University Press.
Clark, A. (2022). Extending the Predictive Mind. Australasian Journal of Philosophy.
Clark, A. & Chalmers, D. (1998). The Extended Mind. Analysis, 58(1), 7–19.
Constant, A., Clark, A., Kirchhoff, M., & Friston, K.J. (2022). Extended active inference: Constructing predictive cognition beyond skulls. Mind and Language, 37(3), 373–394.
Damasio, A.R. (1994). Descartes’ Error: Emotion, Reason, and the Human Brain. Putnam.
Damasio, A.R. (1996). The somatic marker hypothesis and the possible functions of the prefrontal cortex. Philosophical Transactions of the Royal Society B, 351(1346), 1413–1420.
Dunbar, R.I.M. (1992). Neocortex size as a constraint on group size in primates. Journal of Human Evolution, 22(6), 469–493.
Dunbar, R.I.M. (1998). The social brain hypothesis. Evolutionary Anthropology, 6(5), 178–190.
Dunbar, R.I.M. & Shultz, S. (2007). Evolution in the Social Brain. Science, 317(5843), 1344–1347.
Menon, V. & Uddin, L.Q. (2010). Saliency, switching, attention and control: a network model of insula function. Brain Structure and Function, 214(5–6), 655–667.
Pan, A., Bhatia, K., & Steinhardt, J. (2022). The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. ICLR 2022.
Russell, S.J. & Norvig, P. (2021). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.
Seeley, W.W. et al. (2007). Dissociable intrinsic connectivity networks for salience processing and executive control. Journal of Neuroscience, 27(9), 2349–2356.