The ML ontology and the alignment ontology

Richard_Ngo

This post contains some rough reflections on the alignment community trying to make its ontology legible to the mainstream ML community, and the lessons we should take from that experience.

Historically, it was difficult for the alignment community to engage with the ML community because the alignment community was using a fundamentally different ontology—featuring concepts like inner vs outer alignment, mesa-optimizers, corrigibility, situational awareness, and so on. Even a concept as simple as "giving an AI an instruction in natural language" often threw a kind of type error in the ML ontology, in which goals were meant to be specified by setting agents' reward functions.^[1]

The concept of situational awareness is another one which doesn't really make sense in the classic ML ontology. My impression is that Ilya starting to take situational awareness seriously (after Ajeya gave a talk about it at OpenAI) was one of the main drivers of his transition to alignment research. Unfortunately, Ilya's subsequent research on weak-to-strong generalization stayed pretty stuck in the ML ontology, which in my opinion made it unpromising from the get-go. (I don't remember if I stated this publicly at the time, but I was pretty critical internally inside OpenAI, especially to Collin Burns. In hindsight I wish I'd clearly stated publicly that I wasn't very excited about the research.)

These are two of many examples over the last few years of the alignment ontology winning out over the ML ontology by being better at describing LLMs. In response, the ML ontology has expanded to include concepts like "giving AIs instructions" and "situational awareness", but not in any principled way—it's sort of shoehorned them in without most people noticing the confusion. (E.g. if you ask why the AIs are following instructions, or how situational awareness might develop, I think most ML researchers would give you pretty confused answers.)

Historically, it was sometimes possible to make alignment concepts legible in the ML ontology before compelling empirical evidence arose, but it was a typically a very laborious and unrewarding process. ML researchers would raise objections that felt extremely nitpicky from the alignment ontology. In part this was due to the difficulty of communicating across ontologies, but in part it was also due to motivated reasoning to find reasons to reject claims made by alignment proponents (e.g. I think this post from Chollet is a pretty good example). Even when ML researchers agreed that an alignment concept made sense in principle, it was usually hard for them to then propagate the consequences into the rest of their ontology—in part because doing so would have had big implications for their identity and career plans.

Meanwhile, the alignment community would waste time, and sometimes make itself more confused, by trying to adapt their concepts to make more sense to ML researchers. "Goal misgeneralization" is a good example of this, since the problem of inner misalignment is more that correct generalization isn't a well-defined concept, than that the agent will learn to "misgeneralize". MIRI's paper on Formalizing Convergent Instrumental Goals seems like it also wasn't very useful, especially compared with their other research (though unlike goal misgeneralization I doubt it made many people more confused). Owain Evans' "out-of-context reasoning" is a case that I'm less confident about, since it does seem like putting the idea in ML terms has helped him and others do interesting empirical research on it.

I did a lot of this myself too, to be clear. "Trying to make alignment concepts legible in the ML ontology" was in some sense the main goal of my alignment problem from a deep learning perspective paper, and I've updated significantly downwards on its value since starting to think in these terms. In hindsight, the main thing I would've told my past self (and the rest of the alignment community) is to pay less attention to the ML ontology. Unfortunately, my sense is that OpenPhil and various other groups (including my past self) pushed pretty hard for engagement with the ML ontology, which I count as a significant mistake.

There are still ways that engaging with the ML community would have been valuable—I think mainstream ML researchers are good at pushing alignment researchers to be more precise and more grounded in the existing literature. But broadly speaking it would've been better to have treated alignment ideas like butterfly ideas which would be harmed by premature exposure to ML thinking.

I suspect that many AI safety researchers will resonate with the broad outlines of what I've discussed above. Below is the part that I expect will be more controversial.

Unfortunately much of the alignment community today seems to be in an analogous position to the ML community during the 2010s. Concepts like scheming, alignment faking, alignment research, strategy research, P(doom), misuse vs misalignment, AGI timelines, and so on seem to me to be sufficiently vague and/or confused that it's hard to think clearly about AGI when they're important parts of your ontology.

This is a pretty broad claim, so let me be a little more specific. Suppose we very roughly divide the AI safety community into the parts that are more EA-affiliated (most lab safety teams, most orgs working out of Constellation, OpenPhil, etc) and the parts that are more Less Wrong affiliated (e.g. almost everyone on Habryka's list of individuals in this comment). I think my diagnosis above is partially true of LW safety, but strongly true of EA safety. The people who are generating novel and important AGI-related concepts are almost all pretty decoupled from EA safety, even though that's where most of the money and jobs are:

There's a line of thinking which understands models in terms of their personas and identities. I'd point to Janus, Fiora, nostalgebraist and Cleo Nardo as some EA safety "outsiders" who have developed ideas in this space. While some of these concepts are getting picked up within EA safety (e.g. by Anthropic's interpretability team), the ontology gap is still large enough to cause adversarial dynamics.
MIRI's former agent foundations team did (and continues to do) a lot of great thinking, despite EA safety's strong skepticism of agent foundations.
Jan Kulveit and his collaborators are probably the people most closely affiliated with EA safety who I think are doing strong novel thinking (although I suspect that their thinking would be better if they decoupled more).
Michael Vassar and his collaborators are doing an excellent job at understanding the sociopolitical dynamics governing both society at large and also the AI safety community. This cluster is decoupled both from EA safety and from LW safety.

Related to the last point: I personally feel pretty decoupled from LW safety in part because LW safety is focusing a lot on AI governance these days, but has a very different ontology for thinking about politics than I do. In fact, I originally started writing this post as an analogy for how I relate to the LessWrong community with regards to politics. However, ontology gaps in alignment seem sufficiently important that I decided to make this post solely about them, and save the analogizing to politics for a separate post or shortform.

To sum up my main takeaways: while working in the dominant ontology may seem like the "safest" and most reputable bet, real progress comes from building out a different ontology to the point where it can replace the old one. Good luck to anyone who's trying to do that!

^{^}
The development of instruction fine-tuning helped bridge this gap—but in part by introducing the concept of a persona, which fits into neither the ML ontology nor the alignment ontology.

fwiw, my impression at the time was weak to strong was explicitly intended to take alignment and frame it in the ML ontology, so that ML researchers could contribute to it. (in retrospect this didn't pan out, imo somewhat predictably, though I wasn't ultra confident it wouldn't pan out.) also, in general, I think the final paper ended up predominately reflecting Collin's vision, more so than Ilya's or anyone else's.

I'm still not convinced any of the persona stuff has produced anything of value. it appears to me to be unfalsifiable just so stories. as cynical as I am about typical lab safety work, I've come to believe that even typical lab safety is more positively impactful than the persona stuff (because in short timeline worlds, the alignedness of current models is more directly relevant to the base case of RSI). the persona people like to complain about how hamfisted the lab safety work is or whatever, but it's always easier to complain from the sidelines.

My guess is that thinking about the pretrained base model as a prior over personas, which is conditioned by fine-tuning, has helped guide and explain the kind of research that Owain Evans does (i.e. emergent misalignemnt, science of llm generalisation), but I might be wrong about how useful it has been there. And maybe it was upstream of some over stuff.

And it was kinda off-target in a bunch of ways that weren't obvious at the time.

I think it's mostly priced in, so i would say it's neither overrated or underrated by the community currently. Overall, I would give it maybe a B+ for conceptual work.

typical lab safety work

What counts as typical lab safety work in the OpenAI case?

I worry this contributes to polarizing people along "group" boundaries.

While some of these concepts are getting picked up within EA safety (e.g. by Anthropic's interpretability team),

I strongly disagree with the idea that the idea that "personas are an important way of thinking about the model" is "just getting picked up" by Anthropic. I view the recent work as largely trying to make this legible to labs that are less convinced (ex: Leo's comment), and to potentially empirically understand the boundaries of this persona framing.

the ontology gap is still large enough to cause adversarial dynamics.

I think framing "EA Safety at Anthropic" as a monolith that has adversarial dynamics with Janus for example isn't a particularly useful framing. I would be somewhat surprised if Janus described the source of their disagreements to be about actual underlying ontology, (You could argue "what do we mean by alignment" is a disagreement about ontology, but that seems to be broad enough to encompass any disagreement in the field).

Unfortunately much of the alignment community today seems to be in an analogous position to the ML community during the 2010s. Concepts like scheming [...] seem to me to be sufficiently vague and/or confused that it's hard to think clearly about AGI when they're important parts of your ontology.

I can't speak for the prevalence of the other concepts, but I disagree with your characterization of how researchers think about scheming in my experience, For example:

Yet generally people treat "schemer" as a fairly binary classification. To be clear, I'm not confident that even "a spectrum of scheminess" is a good way to think about the concept. There are likely multiple important dimensions that could be disentangled; and eventually I'd like to discover properly scientific theories of concepts like honesty, deception and perhaps even "scheming". - link

It has not been my experience with researchers, or in papers on this, etc that this is ever talked about as a fairly binary classification. I don't think it's realistic to block empirical work on a scientific understanding of honesty.

This definition? If so, it seems vastly underspecified to be fit for scientific inquiry. For one thing, the definition of "selection" is pretty vague [...] If people had tried to pin down more carefully what "schemer" means they would have been forced to develop a more nuanced understanding of what we even mean by "alignment" and "goals" and so on, which is the kind of thinking I want to see more of. - link

I think Alex Mallen is doing good work in the exact domain of having more nuanced dilenation of various ontologies related to scheming, ex: Fitness-Seekers: Generalizing the Reward-Seeking Threat Model. Or Vivek Hebbar ex: How training-gamers might function (and win) (the top comment is yours and seems to agree).

I actually disagree that it would be helpful to make figuring out what we mean by "alignment" a blocking requirement for having a working concept that's used the way scheming is currently used.

progress comes from building out a different ontology to the point where it can replace the old one. Good luck to anyone who's trying to do that!

I'm trying to do that! And thanks! It's much appreciated. I can't tell if I'm succeeding and don't expect other people can either, which is a kinda stressful place to be.

I wonder if there could be a name for the new ontology you’re trying to gesture at? I have some ideas, but I don’t know if it’s too early to pin the butterfly to the board?

Do you know any concrete resources, even just discussions in other settings, that address how to tangibly attempt to bridge this gap?

I've run into a similar problem to the degree that it stops projects dead - no realistic means of reaching an audience translates into low motivation, despite what I believe are nontrivial new ideas

To take an example, 4E cogsci concretely applied to the alignment/subjecthood dialog. I find myself in some (worse?) netherworld of trying to glue together an epistemic gap, where the audience capable of applying technical downstreams (ML certainly, but even alignment/safety folks) find 4E epistemics foreign and antithetical to the grammar of formalist thought, while the audience able to engage the 4E basics lacks the technical training to pursue the line of thought

I've never heard of 4E before, but I checked the wikipedia page and it sounds cool and like it relates to my ontology I've been trying to develop called "outcome influencing systems" (OISs), particularly in my observation that OISs are composed of one another and overlap with one another.

What would you suggest for someone who wants to engage more with 4E?

This paper is a solid and readable introduction to the concepts. The abstract does it more justice than I would:

The emerging viewpoint of embodied cognition holds that cognitive processes are deeply rooted in the body’s interactions with the world. This position actually houses a number of distinct claims, some of which are more controversial than others. This paper distinguishes and evaluates the following six claims: (1) cognition is situated; (2) cognition is time-pressured; (3) we off-load cognitive work onto the environment; (4) the environment is part of the cognitive system; (5) cognition is for action; (6) offline cognition is body based. Of these, the first three and the fifth appear to be at least partially true, and their usefulness is best evaluated in terms of the range of their applicability. The fourth claim, I argue, is deeply problematic. The sixth claim has received the least attention in the literature on embodied cognition, but it may in fact be the best documented and most powerful of the six claims.

For something more in depth, Andy Clark's (of Surfing Uncertainty) Supersizing the Mind: Embodiment, Action, and Cognitive Extension should be legible and appealing to a LW reader.

Here's a quick TLDR:

The 4E approach views cognition as Embodied, Embedded, Enacted, and Extended, meaning that cognition is realized through a living system’s body, environment, and ongoing activity within that environment. The 4E view treats these as constitutive, not derivative qualities of cognition. The classical definitions are as follows:
• Embodied: Cognition is shaped by the physical body and sensorimotor experience.
• Embedded: Cognitive processes are inseparable from environmental interactions.
• Enacted: Cognition emerges through a history of structural coupling with the environment.
• Extended: Cognitive processes can extend beyond biological boundaries into tools and the environment.

If you're curious, here's a draft of my connection to alignment/subjecthood. If you come from an ML angle, it might be easier to interpret; there is indeed an epistemic move to make to understand the 4E thread.