# The ML ontology and the alignment ontology

*   By [Richard_Ngo](/users/ricraz)
*   2026-02-24 04:39:19Z
*   109 points
*   Tag: [AI](/w/ai)
*   Frontpage
*   Comments: 9
*   Post URL (HTML): [/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology](/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology)
*   Post URL (Markdown): [/api/post/the-ml-ontology-and-the-alignment-ontology](/api/post/the-ml-ontology-and-the-alignment-ontology)
*   Comments URL (Markdown): [/api/post/the-ml-ontology-and-the-alignment-ontology/comments](/api/post/the-ml-ontology-and-the-alignment-ontology/comments)
*   Post URL (Markdown, compact): [/api/post/the-ml-ontology-and-the-alignment-ontology?compact=1](/api/post/the-ml-ontology-and-the-alignment-ontology?compact=1)

This post contains some rough reflections on the alignment community trying to make its ontology legible to the mainstream ML community, and the lessons we should take from that experience.

Historically, it was difficult for the alignment community to engage with the ML community because the alignment community was using a fundamentally different ontology—featuring concepts like inner vs outer alignment, mesa-optimizers, corrigibility, situational awareness, and so on. Even a concept as simple as "giving an AI an instruction in natural language" often threw a kind of type error in the ML ontology, in which goals were meant to be specified by setting agents' reward functions.[^g4ce0gitess]

The concept of situational awareness is another one which doesn't really make sense in the classic ML ontology. My impression is that Ilya starting to take situational awareness seriously (after Ajeya gave a talk about it at OpenAI) was one of the main drivers of his transition to alignment research. Unfortunately, Ilya's subsequent research on [weak-to-strong generalization](https://openai.com/index/weak-to-strong-generalization/) stayed pretty stuck in the ML ontology, which in my opinion made it unpromising from the get-go. (I don't remember if I stated this publicly at the time, but I was pretty critical internally inside OpenAI, especially to Collin Burns. In hindsight I wish I'd clearly stated publicly that I wasn't very excited about the research.)

These are two of many examples over the last few years of the alignment ontology winning out over the ML ontology by being better at describing LLMs. In response, the ML ontology has expanded to include concepts like "giving AIs instructions" and "situational awareness", but not in any principled way—it's sort of shoehorned them in without most people noticing the confusion. (E.g. if you ask *why* the AIs are following instructions, or how situational awareness might develop, I think most ML researchers would give you pretty confused answers.)

Historically, it was sometimes possible to make alignment concepts legible in the ML ontology before compelling empirical evidence arose, but it was a typically a very laborious and unrewarding process. ML researchers would raise objections that felt extremely nitpicky from the alignment ontology. In part this was due to the difficulty of communicating across ontologies, but in part it was also due to motivated reasoning to find reasons to reject claims made by alignment proponents (e.g. I think [this post from Chollet](https://medium.com/@francois.chollet/the-impossibility-of-intelligence-explosion-5be4a9eda6ec) is a pretty good example). Even when ML researchers agreed that an alignment concept made sense in principle, it was usually hard for them to then propagate the consequences into the rest of their ontology—in part because doing so would have had big implications for their identity and career plans.

Meanwhile, the alignment community would waste time, and sometimes make itself more confused, by trying to adapt their concepts to make more sense to ML researchers. "[Goal misgeneralization](https://arxiv.org/abs/2210.01790)" is a good example of this, since the problem of inner misalignment is more that correct generalization isn't a well-defined concept, than that the agent will learn to "misgeneralize". MIRI's paper on [Formalizing Convergent Instrumental Goals](https://intelligence.org/files/FormalizingConvergentGoals.pdf) seems like it also wasn't very useful, especially compared with their other research (though unlike goal misgeneralization I doubt it made many people more confused). Owain Evans' "[out-of-context reasoning](https://arxiv.org/abs/2309.00667)" is a case that I'm less confident about, since it does seem like putting the idea in ML terms has helped him and others do interesting empirical research on it.

I did a lot of this myself too, to be clear. "Trying to make alignment concepts legible in the ML ontology" was in some sense the main goal of my [alignment problem from a deep learning perspective](https://arxiv.org/abs/2209.00626) paper, and I've updated significantly downwards on its value since starting to think in these terms. In hindsight, the main thing I would've told my past self (and the rest of the alignment community) is to pay less attention to the ML ontology. Unfortunately, my sense is that OpenPhil and various other groups (including my past self) pushed pretty hard for engagement with the ML ontology, which I count as a significant mistake.

There are still ways that engaging with the ML community would have been valuable—I think mainstream ML researchers are good at pushing alignment researchers to be more precise and more grounded in the existing literature. But broadly speaking it would've been better to have treated alignment ideas like [butterfly ideas](/api/post/R6M4vmShiowDn56of) which would be harmed by premature exposure to ML thinking*.*

* * *

I suspect that many AI safety researchers will resonate with the broad outlines of what I've discussed above. Below is the part that I expect will be more controversial.

Unfortunately much of the alignment community today seems to be in an analogous position to the ML community during the 2010s. Concepts like [scheming](/api/post/FuGfR3jL3sw6r8kB4?commentId=ixQRCdJZz6xTB8uyb), [alignment faking](/api/post/PWHkMac9Xve6LoMJy), [alignment research](/api/post/67fNBeHrjdrZZNDDK), [strategy research](/api/post/FuGfR3jL3sw6r8kB4?commentId=sAhSxKLu39Y6ky4SG), [P(doom)](/api/post/kcKrE9mzEHrdqtDpE?commentId=2HZCnmo32ZX2Dmsxo), [misuse vs misalignment](https://www.youtube.com/watch?v=4v3uqWeVmco), [AGI timelines](https://x.com/richardmcngo/status/2026138292383887730?s=46), and so on seem to me to be sufficiently vague and/or confused that it's hard to think clearly about AGI when they're important parts of your ontology.

This is a pretty broad claim, so let me be a little more specific. Suppose we very roughly divide the AI safety community into the parts that are more EA-affiliated (most lab safety teams, most orgs working out of Constellation, OpenPhil, etc) and the parts that are more Less Wrong affiliated (e.g. almost everyone on [Habryka's list of individuals in this comment](/api/post/wn5jTrtKkhspshA4c?commentId=zoBMvdMAwpjTEY4st)). I think my diagnosis above is partially true of LW safety, but strongly true of EA safety. The people who are generating novel and important AGI-related concepts are almost all pretty decoupled from EA safety, even though that's where most of the money and jobs are:

*   There's a line of thinking which understands models in terms of their personas and identities. I'd point to [Janus](/api/sequence/N7nDePaNabJdnbXeE/post/vJFdjigzmcXMhNTsx), [Fiora](/api/post/ioZxrP7BhS5ArK59w), [nostalgebraist](/api/post/3EzbtNLdcnZe8og8b) and [Cleo Nardo](/api/post/D7PumeYTDPfBTp3i7) as some EA safety "outsiders" who have developed ideas in this space. While some of these concepts are getting picked up within EA safety (e.g. by [Anthropic's interpretability team](https://www.anthropic.com/research/persona-selection-model)), the ontology gap is still large enough to cause [adversarial dynamics](https://x.com/repligate/status/2014155341274210390?s=46).
*   [MIRI's former agent foundations team](https://x.com/richardmcngo/status/2025001871782674552?s=46) did (and continues to do) a lot of great thinking, despite EA safety's strong skepticism of agent foundations.
*   Jan Kulveit and his collaborators are probably the people most closely affiliated with EA safety who I think are doing strong novel thinking (although [I suspect that their thinking would be better if they decoupled more](/api/post/FuGfR3jL3sw6r8kB4?commentId=ejoL2TvK7LKYpaD3K)).
*   [Michael Vassar and his collaborators](https://naturalhazard.xyz/ben_jess_sarah_starter_pack) are doing an excellent job at understanding the sociopolitical dynamics governing both society at large and also the AI safety community. This cluster is decoupled both from EA safety and from LW safety.

Related to the last point: I personally feel pretty decoupled from LW safety in part because LW safety is focusing a lot on AI governance these days, but has a very different ontology for thinking about politics [than I do](https://21civ.com). In fact, I originally started writing this post as an analogy for how I relate to the LessWrong community with regards to politics. However, ontology gaps in alignment seem sufficiently important that I decided to make this post solely about them, and save the analogizing to politics for a separate post or shortform.

To sum up my main takeaways: while working in the dominant ontology may seem like the "safest" and most reputable bet, real progress comes from building out a different ontology to the point where it can replace the old one. Good luck to anyone who's trying to do that!

[^g4ce0gitess]: The development of instruction fine-tuning helped bridge this gap—but in part by introducing the concept of a persona, which fits into neither the ML ontology nor the alignment ontology.

Top Comments Index
------------------

### Comment by [leogao](/users/leogao)

*   2026-02-24 07:42:43Z
*   Karma: 12
*   Voting system: namesAttachedReactions
*   Approval votes: 7
*   Total votes: 7
*   HTML permalink: [/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology/comment/78LwziMNdDmxnnLQ2](/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology/comment/78LwziMNdDmxnnLQ2)
*   Markdown permalink: [/api/post/the-ml-ontology-and-the-alignment-ontology/comments/78LwziMNdDmxnnLQ2](/api/post/the-ml-ontology-and-the-alignment-ontology/comments/78LwziMNdDmxnnLQ2)

Reactions (whole comment):

*   disagree: 1

Reactions by quoted text:

*   "I'm still not convinced any of the persona stuff has produced anything of value. "
    
    *   disagree: 3
*   "it appears to me to be unfalsifiable just so stories"
    
    *   90percent: 1

### Comment by [Bronson Schoen](/users/bronson-schoen)

*   2026-02-25 09:03:00Z
*   Karma: 8
*   Voting system: namesAttachedReactions
*   Approval votes: 6
*   Total votes: 6
*   HTML permalink: [/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology/comment/6ArRbpNhKtDJa3ita](/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology/comment/6ArRbpNhKtDJa3ita)
*   Markdown permalink: [/api/post/the-ml-ontology-and-the-alignment-ontology/comments/6ArRbpNhKtDJa3ita](/api/post/the-ml-ontology-and-the-alignment-ontology/comments/6ArRbpNhKtDJa3ita)

### Comment by [TristanTrim](/users/tristantrim)

*   2026-02-27 00:01:38Z
*   Karma: 6
*   Voting system: namesAttachedReactions
*   Approval votes: 2
*   Total votes: 2
*   HTML permalink: [/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology/comment/mshPQoSJyhEJSiW35](/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology/comment/mshPQoSJyhEJSiW35)
*   Markdown permalink: [/api/post/the-ml-ontology-and-the-alignment-ontology/comments/mshPQoSJyhEJSiW35](/api/post/the-ml-ontology-and-the-alignment-ontology/comments/mshPQoSJyhEJSiW35)

### Comment by [JennaS](/users/jennas)

*   2026-02-24 09:10:59Z
*   Karma: 6
*   Voting system: namesAttachedReactions
*   Approval votes: 2
*   Total votes: 2
*   HTML permalink: [/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology/comment/TD3DAwRJfWuudpukH](/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology/comment/TD3DAwRJfWuudpukH)
*   Markdown permalink: [/api/post/the-ml-ontology-and-the-alignment-ontology/comments/TD3DAwRJfWuudpukH](/api/post/the-ml-ontology-and-the-alignment-ontology/comments/TD3DAwRJfWuudpukH)

### Comment by [Zsolt Tanko](/users/zsolt-tanko)

*   2026-02-26 10:34:26Z
*   Karma: 2
*   Voting system: namesAttachedReactions
*   Approval votes: 2
*   Total votes: 2
*   HTML permalink: [/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology/comment/HdvHkJT7ab5E2rJ3s](/posts/Yz4YHncz2vwN4ksDA/the-ml-ontology-and-the-alignment-ontology/comment/HdvHkJT7ab5E2rJ3s)
*   Markdown permalink: [/api/post/the-ml-ontology-and-the-alignment-ontology/comments/HdvHkJT7ab5E2rJ3s](/api/post/the-ml-ontology-and-the-alignment-ontology/comments/HdvHkJT7ab5E2rJ3s)

### Navigation

*   [Front page](https://www.lesswrong.com/api/home)
*   [Markdown API documentation](https://www.lesswrong.com/api/SKILL.md)