Discussion with Nate Soares on a key alignment difficulty

HoldenKarnofsky

In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.

I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.¹ My short summary is:

Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough.
I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes.

I didn't end up agreeing that this difficulty is as important as Nate thinks it is, although I did update my views some (more on that below). My guess is that this is one of the two biggest disagreements I have with Nate's and Eliezer's views (the other one being the likelihood of a sharp left turn that leads to a massive capabilities gap between AI systems and their supervisors.²)

Below is my summary of:

Some key premises we agree on.
What we disagree about, at a high level.
A hypothetical training process we discussed in order to get more clear and mechanistic about Nate's views.
Some brief discussion of possible cruxes; what kind of reasoning Nate is using to arrive at his relatively high (~85%) level of confidence on this point; and future observations that might update one of us toward the other's views.

MIRI might later put out more detailed notes on this exchange, drawing on all of our discussions over Slack and comment threads in Google docs.

Nate has reviewed this post in full. I'm grateful for his help with it.

Some starting points of agreement

Nate on this section: “Seems broadly right to me!”

An AI is dangerous if:

It's powerful (like, it has the ability to disempower humans if it's "aiming" at that)
It aims (perhaps as a side effect of aiming at something else) at CIS (convergent instrumental subgoals) such as "Preserve option value," "Gain control of resources that can be used for lots of things," "Avoid being turned off," and such. (Note that this is a weaker condition than "maximizes utility according to some relatively simple utility function of states of the world")
It does not reliably avoid POUDA (pretty obviously unintended/dangerous actions) such as "Design and deploy a bioweapon."
- "Reliably" just means like "In situations it will actually be in" (which will likely be different from training, but I'm not trying to talk about "all possible situations").
- Avoiding POUDA is kind of a low bar in some sense. Avoiding POUDA doesn't necessarily require fully/perfectly internalizing some "corrigibility core" (such that the AI would always let us turn it off even in arbitrarily exotic situations that challenge the very meaning of "let us turn it off"), and it even more so doesn't require anything like CEV. It just means that stuff where Holden would be like "Whoa whoa, that is OBVIOUSLY unintended/dangerous/bad" is stuff that an AI would not do.
- That said, POUDA is not something that Holden is able to articulate cleanly and simply. There are lots of actions that might be POUDA in one situation and not in another (e.g., developing a chemical that's both poisonous and useful for other purposes), and Holden isn't able to simply describe what distinguishes such situations. So, it's at least a live possibility that "reliably avoiding POUDA" is the kind of thing that would be hard to preserve under distributional shift.

If humans are doing something like "ambitiously pushing AIs to do more and more cool, creative stuff that humans couldn't do, using largely outcomes-based training," then:

They're really pushing AIs to aim at CIS.
They're probably training not "avoid POUDA" but a bastardized/illusory version of "avoid POUDA", something more like "avoid things that look to your overseers like POUDA in situations where your overseers can give negative reinforcement for the POUDA."
So if this activity results in powerful AI systems, these are probably AI systems that do aim at CIS, but only avoid POUDA under conditions (key condition being "the overseer will catch them and stop them") that no longer hold.
That would suck!

High-level disagreement

Holden thinks there may be alternative approaches to training AI systems that:

Are powerful enough to do things that help us a lot, such as moving the needle a lot on alignment research (Holden and Nate agree that we want something like "Can have lots of novel insights, challenge existing paradigms and move forward with new ones, etc." not just something like CoPilot)
Don't have the problem of "training AI systems to aim at CIS at a stage where they are not reliably avoiding POUDA."
Are kinda live possibilities that people are actively contemplating and might carry out.

Nate disagrees with this. He thinks there is a deep tension between the first two points. Resolving the tension isn't necessarily impossible, but most people just don't seem to be seriously contending with the tension. Nate endorses this characterization.

In order to explore this, Nate and Holden explored a hypothetical approach to training powerful AI systems, chosen by Holden to specifically have the property: "This is simple, and falls way on the safe end of the spectrum (it has a good chance of training 'avoid POUDA' about as fast or at least as fast as training 'aim at CIS'").

In a world where this hypothetical approach had a reasonable (20%+) chance of resulting in safe, powerful AI, Holden would think that there are a lot of other approaches that are more realistic while having key properties in common, such that "We just get lucky and the first powerful AI systems are safe" is a live possibility - and adding some effort and extra measures could push up the probability higher.

In a world where this hypothetical approach was very unlikely (10% or less) to result in safe, powerful AI, Holden would think something more like: "We're not just gonna get lucky, we're going to need big wins on interpretability or checks-and-balances or fundamentally better (currently unknown) approaches to training, or something else."

Hypothetical training approach

This isn't meant to be realistic; it's meant to be simple and illustrative along the lines of the above.

Basically, we start with a ~1000x scaleup of GPT-3 (params-wise), with increased data and compute as needed to optimize performance for a NN of that size/type.

We assume that at some point during this scaled-up pretraining, this model is going to gain the raw capability to be capable of (if aimed at this) pretty robustly filling in for today's top AI alignment researchers, in terms of doing enough alignment work to "solve alignment" mostly on its own. (This might take the form of e.g. doing more interpretability work similar to what's been done, at great scale, and then synthesizing/distilling insights from this work and iterating on that to the point where it can meaningfully "reverse-engineer" itself and provide a version of itself that humans can much more easily modify to be safe, or something.)

We're then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We're going to try to use the RL to train: "Act exactly like [a given alignment researcher] would act." We're going for pure imitation in some sense (although we need it to work out-of-distribution, in the sense that the AI needs to be able to continue doing what its imitatee would do, even when faced with research questions and insights unlike those seen in training).

Given these assumptions, the question is: would such a model be dangerous? That is, would it both (a) aim at CIS and (b) not reliably avoid POUDA, at least in situations as exotic as "Thinking (in an inspired manner that ultimately leads to a solution) about interpretability insights and issues beyond what it ever saw in training?"

Why this setup is on the "safe" end according to Holden's views: Mostly the safety comes from going for "pure imitation" in the sense I said above.

Whatever alignment researcher the AI is imitating has a certain amount of CIS, but also has great POUDA-avoidance. So a fully faithful imitation should be safe. Holden and Nate agree on this paragraph.

(Back to things only-Holden thinks) We are avoiding giving an extra hard push toward CIS (of the kind that we'd give if we were really pushing AI systems to be ambitious and creative in "superhuman" ways), and we are avoiding training the kind of "bastardized POUDA avoidance" described above (because there are few opportunities for us to screw up the anti-POUDA signal).

How this ends up being dangerous anyway, according to Nate

High-level premises

This section is trying to characterize Nate’s views, not mine (I partly disagree, as I discuss below). Nate: "I broadly endorse this. (I wouldn't use the same words in quite the same ways, but \shrug, it's pretty decent.)"

The high-level premises that imply danger here follow. (I believe both of these have to go through in order for the hypothesized training process to be dangerous in the way Nate is pointing at). (I’d suggest skipping/skimming the sub-bullets here if they’re getting daunting, as the following section will also provide some illustration of the disagreement.)

The path to "doing needle-moving alignment research (or other useful stuff)" runs through CIS. That is, an AI has to pick up a significant amount of CIS-y behavior/patterns/habits/subsystems in order to get to the point of doing the useful stuff.
1. A key property of "needle-moving" alignment research is that you're able to have insights well outside of the training distribution - this is needed for sufficient usefulness, and it is what needs to be reached through CIS. You could train a "shallow" imitation learner to analyze 1000 neurons similarly to how Anthropic's team has analyzed 10, but "needle-moving" research requires being able to take a step after that where you're in a state something like: "OK, I have all these observations, but I'm not sure what to do next - that isn't well specified³ by the plan my boss handed me [or the plan I wrote down, or the plan one might've inferred from my steps to this point]. To the extent I'm able to figure out what comes next by pattern-matching, the most useful patterns are the ones where someone accomplished CISy things like 'Generate a comprehensive picture of the problem, including how it fits into the big picture and what we're trying to do and why' or 'Gain more general knowledge and option value' or 'Identify obstacles to progress and dispatch/avoid them'"
2. A particular pattern Nate has talked about is what I might call "reflection." The basic idea is that in order to do certain kinds of research effectively, you need to keep asking questions like "What am I actually trying to do here and why? What is my big-picture goal?", which are questions that might "change your aims" in some important sense. The idea is not necessarily that you're rewriting your own source code, but that you're doing the kind of reflection and self-modification a philosophically inclined, independent-minded human might do: "I've always thought I cared about X, but when I really think about the implications of that, I realize maybe I only care about Y" and such. I think that in Nate's ontology (and I am partly sympathetic), it's hard to disentangle something like "Refocusing my research agenda to line it up with my big-picture goals" from something like "Reconsidering and modifying my big-picture goals so that they feel more satisfying in light of all the things I've noticed about myself." Reflection (figuring out what you "really want") is a kind of CIS, and one that could present danger, if an AI is figuring out what it "really wants" and we haven't got specific reasons to think that's going to be what we want it to want.
It's relatively hard to pick up a robust POUDA-avoidance pattern, even without the kind of actively screwed up data points that produced the "bastardized POUDA avoidance" mentioned above. Specifically, it's significantly harder than picking up enough CIS to be dangerous, such that the odds are very high that Holden's training procedure ends up with a significant phase where the AI has picked up enough CIS to be dangerous but not enough POUDA-avoidance to be safe.
1. "Robust" here means "Would operate as intended in situations of at least the novelty required to do needle-moving research."
2. "Relatively hard" basically means "It would take a lot more training to learn POUDA-avoidance than it would take to learn enough CIS to be both useful and dangerous. (Also, there's a sense that the needed training for the latter is a subset of the needed training for the former, i.e., this isn't just about quantities of training but about a deeper difficulty with learning POUDA-avoidance without first learning quite a lot of CIS.)"
3. An underlying intuition here is that CIS-y stuff is "everywhere" in what humans are doing; in general, when you're trying to predict a human's next action, CIS will get you quite far, whereas you'd need more data to effectively pick out the specifics of POUDA-avoidance (especially if these specifics are supposed to keep holding in very out-of-distribution scenarios, which you'd have to get into in order to do needle-moving alignment research).
4. An additional relevant point is that the less granularly you're making your AI imitate a human, the more you're doing something like "outcomes-based trial-and-error" where the AI could be achieving human-like end products with very alien intermediate steps. This basically brings in a somewhat but not radically softer version of the same problems described at the top of the doc: the AI is under selection pressure to achieve CIS and to achieve apparent POUDA-avoidance, but this could be the "bastardized POUDA-avoidance" of e.g. not harming humans so far as humans can tell.
5. An important corollary of "relatively hard" is that failure to pick up the pattern well enough to reliably avoid POUDA does not result in too much negative reinforcement at the task of imitating humans (in the pretraining/next-token-prediction setting), at least compared to the negative reinforcement from trying to imitate needle-moving research moves without having picked up a lot of CIS.

How the danger might arise mechanistically

It's not really possible to give a real mechanistic explanation, but I can try to give a rough sketch. An ontology Nate seemed to like (and that seems pretty good to me) is to think of an AI as a dynamically weighted ensemble of "mini-AIs" (my term) - thingies that basically match/promote a particular pattern. ("Dynamically weighted" means that a mini-AI that is pushing in successful directions gets weighted more strongly as a result.)

So let's imagine that we're training a Transformer-type thing on next-token prediction, and let's take as a stipulation that this is somehow going to get it to the point of having the capability to do needle-moving alignment research, at least when it's imitating a particular alignment researcher named Alice. The basic idea is that the next-token prediction gets it to the point where, if prompted to predict what it will observe next in a context where this requires predicting Alice's behavior, it will keep predicting specific reasonable next steps that Alice will take, even after the point where these next steps take us to the frontiers of knowledge/understanding about AI alignment. We'll then use a small amount of prompting, RL, etc. to point it consistently in this direction such that it is taking or describing these steps consistently.

For simplicity, we'll talk about Anthropic-style mechanistic interpretability.

Here's the general sort of thing Nate sees happening in this case:

Very early on (when it just sucks at everything), the AI makes pretty random guesses about what tokens will come next in contexts involving Alice. It gets lower loss when its guess is better, and this causes it to form and/or up-weight mini-AIs matching shallow/relatively useless things like "When asked what her favorite color is, Alice replies that it's blue" and "When Alice finishes examining one neuron in a large NN, she starts examining another neuron" and whatever.

At some point, the AI's predictions of Alice run out of this sort of low-hanging fruit.

Research improvement. In order to improve further, it will have to accurately predict Alice's next steps in situations unlike anything that has happened (chronologically) before - such as (to give a cartoon example) "when Alice finishes decoding a large number of neurons, and has to reflect about how to redesign her overall process before she moves on to doing more" or "when Alice finishes decoding *all* the neurons in an AI, and needs to start thinking about how they fit together." (This feels kinda unrealistic for the kind of pretraining that's common today, but so does actually learning how to do needle-moving alignment research just from next-token prediction. If we *condition on* the latter, it seems kinda reasonable to imagine there must be cases where an AI has to be able to do needle-moving alignment research in order to improve at next-token prediction, and this feels like a reasonable way that might happen.)

Here, Nate claims, we should basically think that one of two classes of thing kinda has to happen:

Success via CIS. The AI forms and/or up-weights mini-AIs doing things like: "Reflect on your goal and modify it"; "Do things that will be useful for your big-picture goal in the world [what goal? ~Any goal will do], via getting you to a position of greater general understanding, power, option value, etc."; "Identify and remove obstacles that could stop you from achieving whatever it is you might want to achieve." (There could plausibly be smaller "micro-AIs" that are important, like "keep track of the most limited resources"; here I'm just illustrating the basic idea, but I think Nate and I agree that in some sense, the 'smaller' the needed pieces are, the stronger the case is that that they inevitably get up-weighted, so if a bunch of super-small pieces add up to danger, there's a good case for expecting danger.) These mini-AIs are helping the AI to predict Alice's moves because Alice, herself, in these difficult alignment research situations, is calling on her own CIS-y stuff. (Note that these mini-AIs are developed based on predicting all kinds of situations where CIS-y stuff worked, not just predicting Alice.) But once the AI is out of training and trying to push the research frontier, its mini-AIs will continue to cause it to do "reflection" to the point where it ends up with weird goals, and will cause it to do things like "removing obstacles" to those goals.⁴
Or: failure via ambling around. Lacking these sorts of CIS-y mini-AIs, the AI has no way of doing truly Alice-like next steps at these difficult junctures in alignment research. So it makes moves that are kind of locally or superficially Alice-like, but don't actually match what she would do in the tough situations. Maybe a dumb example of this would be that once Alice has finished decoding all the neurons in a NN, the AI's simulation of Alice might decode the first one again, or decode a neuron in some other random NN, or just like write a lot of words about what she's done without those moving her closer to the next important step, or something.

Values improvement. In addition to junctures where the AI needs to be (in some sense, like a next-token prediction sense) good at needle-moving alignment research in order to predict Alice well, there could imaginably be junctures where the AI needs to be (in some sense) good at POUDA-avoidance. For example, the AI should know that Alice isn't likely to respond to the situation "Alone with another person, such that murdering them would go undetected and result in more resources to do useful alignment research" with "Yeah, murder the person."

That example is easy, but getting good enough at POUDA-avoidance to maintain it in truly exotic situations is (Nate claims) likely to require more and broader training (by a lot) than picking up the CIS-y stuff does. By "truly exotic situations," I mean things like "You have just done a lot of steps, many of which involved reflection, with no particular way to get 'back on track' if you've done some of them in goofy ways." as well as things like "You have gained enough insight to be able to conquer the world with high reliability" (I can imagine these two things being importantly "out of distribution" for different reasons.)

I'm not that sure how much there is to say about why/how this is the case, but here are some potential contributors:

A pretraining data set is likely to contain a rich set of examples of Alice and other humans being CIS-y, since those are things humans are constantly doing in lots of contexts. And the CIS-y aspects of what those humans are doing generalize quite far. It is unlikely to contain a comparably rich set of examples of Alice (and other humans) being POUDA-avoidant.
- Maybe part of this: Alice's superficially POUDA-avoidant behavior could be generated via applying CIS to any of a number of goals, given Alice's actual situations in the pretraining data set. For example, if Alice's goal were to conquer the world and turn it into paperclips, she still wouldn't murder random people to take their money - that's just not a good idea for those goals.⁵ So the AI is never forced to pattern-match "true" POUDA-avoidance (of the kind that nice humans actually have) in order to predict Alice's actions.
- A sufficiently intelligent agent with any random goal can perform well in an environment, even when that requires acting as though it has another goal; so, attributing any random goal to Alice will cause the AI to predict her actions just fine, without needing correction. (By contrast, an agent that lacks CIS mini-AIs really has to pick them up to perform well in certain environments.)
It's generally a good bet that the AI will learn all these little patterns (both the CIS-y ones and the POUDA-avoidance onces) in imperfect ways. But as it goes way outside of its training distribution, the screwups in the CIS-y patterns tend to get corrected (by things like "the AI notices it is not gaining power/option value"), and there's no reason for the screwups in the POUDA-avoidance patterns to be similarly corrected.
If the AI does a bunch of screwed-up reflection, it might thereby land in a state where it'd be realistic to do crazy stuff (as humans who have done a lot of reflection sometimes do).
POUDA-avoidance patterns might also be inherently more "brittle" than CIS-y patterns (since the latter correspond to deep facts about how to accomplish ~whatever whereas the former are really all about our contingent values). So they might just generalize weirdly in unusual situations, and/or a capable enough AI might find way of accomplishing its goals that are sort of within the "letter of the law" of POUDA-avoidance as seen in the training distribution (but outside what we'd consider the spirit of POUDA-avoidance). In general, POUDA-avoidance is going to interfere with/block CIS, so anything like "following the letter but not the spirit of POUDA-avoidance" or "self-modifying to stop POUDA-avoiding" will be helpful for CIS/roughly all other goals, so there's a sort of constant pressure for the AI to find stuff like that.

So therefore

This has been an argument that the AI in Holden's training setup nonetheless ends up "aiming at CIS" more easily/faster (by a lot) than it ends up "reliably avoiding POUDA." If it is also powerful, it is therefore dangerous (according to the criteria I laid out at the top). (Or, if it doesn't end up dangerous, this is because it ended up useless.)

If this turns out true, Nate and Holden are on the same page about the implications (confirmed by Nate):

If this procedure results in a dangerous AI, it's super not clear (by default - perhaps if this problem got more attention someone would think of something) how to "fix" it and end up with a reliably POUDA-avoiding AI.
If lots of players are racing to develop and deploy powerful AI systems, using this kind of training setup, that's a really bad situation - like maybe the most careful ones will refrain from deploying their dangerous AI, but not everyone is going to hold off on that for all that long.
So unless we find a way of resolving the central tension, "You have to aim at CIS to do helpful things, and reliable POUDA-avoidance is much harder/longer to learn than aiming at CIS," we're in trouble. (Here "aim at" is using my version of "aim" - which intends to describe an AI acting as though it's trying to do something - rather than Nate's version of "aim".)

Some possible cruxes

Some beliefs I (Holden) have that seem in tension with the story above:

Robust POUDA-avoidance seems like it could be "easy" in the sense that a realistic amount of training is sufficient.
- To be clear, I think both Nate and I are talking about a pretty "thin" version of POUDA-avoidance here, more like "Don't do egregiously awful things" than like "Pursue the glorious transhumanist future." Possibly Nate considers it harder to get the former without the latter than I do.
It seems like the capabilities needed to do alignment research could be "narrower" in some sense than what I think Nate is picturing. Like, he seems to be picturing an actually-effective "automated alignment researcher" necessarily doing a lot of reflection⁶ (this is how it ends up in parts of the distribution that really challenge the POUDA-avoidance), and/or taking on something very ambitious (e.g., disempowering all of humanity) as a necessary consequence of doing needle-moving alignment research. In my head, you can just have something that is dumb in many ways and not particularly "reflective" (at least, not in a way that leads to big self-modification) doing ~arbitrary goodness of alignment research.
- This may be the more important crux, in the sense that if I went with Nate's view here, I might then consider an "automated alignment researcher" to need to go far enough out of distribution that I'd find robust POUDA-avoidance much less plausible.

Where Holden could look to find Nate's source of current confidence (and some reactions from Holden)

Here is basically what I have from Nate in the Slack exchange on this:

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.

(where you win "in hard mode" if you stick only to the training data that people plausibly generate if they're not specifically trying to stave off the difficulties I'm pointing to, and you win only in "easy mode" if your training data is plausibly-generatable but only by people who are intentionally trying to stave off these difficulties. (and if you win only on "easy mode" then i get to keep my "y'all will still die if you don't find others who can also point to problems that you were about to walk into b/c i can't do it alone" card.))

and, like, it's a pretty tricky game to play b/c it's all made-up bullshit and it's hard to agree on who strained credulity more, but there's some sort of idealized game here where it sounds to me like we each expect we'd win if we played it ...

So the place that my brain reports it gets its own confidence from, is from having done exercises that amount to self-play in the game I mentioned in a thread a little while back, which gives me a variety of intuitions about the rows in your table (where I'm like "doing science well requires CIS-ish stuff" and "the sort of corrigibility you learn in training doesn't generalize how we want, b/c of the interactions w/ the CIS-ish stuff")

(that plus the way that people who hope the game goes the other way, seem to generally be arguing not from the ability to exhibit playthroughs that go some other way, but instead be arguing from ignorance / "we just don't know")

i suspect that this is a frustrating answer

Basically I picture Nate trying to think through - in a more detailed, mechanistic way than I can easily picture - how a training process could lead an AI to the point of being able to do useful alignment research, and as he does this Nate feels like it keeps requiring a really intense level of CIS, which then in turn (via the CIS leading the AI into situations that are highly "exotic" in some sense - mostly, I think, via having done a lot of self-modification/reflection?) seems like it goes places where the kind of POUDA-avoidance pattern learned in training wouldn't hold. Nate endorses this paragraph. He adds, via comments: "also because it just went really far. like, most humans empirically don't invent enough nanotech to move the needle, and most societies that are able to do that much radically new reasoning do undergo big cultural shifts relative to the surroundings. like, it probably had to invent new ways of seeing the problems and thinking about them and the CIS stuff generalizes better than the POUDA stuff (or so the hypothesis goes)"

Some more Holden thoughts on this:

It's not implausible to me that one could think about this kind of thing in a lot more detail than I have, to the point where one could be somewhat confident in Nate's view (maybe, like, 70% confident, so there's still a delta here as I believe Nate is around 85% confident in this view). Nate adds: "(tbc, my nines on doom don't come from nines on claims like this, they come from doom being disjunctive. this is but one disjunct.)"

But:

I broadly don’t trust Nate to have thought about this as well as he thinks he has.
I have my own inside-view of how creative intellectual work works, and it disagrees with Nate’s take that it requires the kind of CIS-y moves he refers to.
You could argue that I should defer to Nate’s view on this since he’s almost certainly spent more hours thinking about it than I have, but I wouldn’t agree, for reasons including:
- I feel like I’ve been over things touching on this topic in varying amounts of depth with a number of ML researchers, including Paul, and I think they understand and reject the underlying Nate intuition here. (I think my take here is a huge judgment call and I’m super non-confident in it; I would feel very unsurprised if it turned out they didn’t really understand the intuition or were getting it wrong.)
- Less importantly, I have thought about some vaguely nearby “how creative intellectual work works”-related topics a bunch and think there are some cases in which it seems like at least Eliezer is wrong on such things in clear-cut ways. (E.g., I’ve seen a # of signs that he seems to endorse the “golden age” view examined and criticized here.) For reasons not super easy to spell out, this makes me a bit more comfortable sticking to my judgment on this topic.
(There will probably be a bunch more back-and-forth on this point in the MIRI version of this exchange)

To be clear though, I’m not unaffected by this whole exchange. I wasn’t previously understanding the line of thinking laid out here, and I think it’s a lot more reasonable than coherence-theorem-related arguments that had previously been filling a similar slot for me. I see the problem sketched in this doc as a plausible way AI alignment could turn out to be super hard even with pretty benign-seeming training setups, and not one I’d previously been thinking about. (The argument in this doc isn't clearly more or less reasonable than what I'd been expecting to encounter at some point, so I'm not sure my actual p(doom) changed at all, though it might in the future - see below.)

Future observations that could update Holden or Nate toward the other's views

Nate’s take on this section: “I think my current take is: some of the disagreement is in what sort of research output is indicative of needle-moving capability, and historically lots of people have hope about lots of putative alignment work that I think is obviously hopeless, so I'm maybe less optimistic than Holden here about getting a clear signal. But I could imagine there being clear signals in this general neighborhood, and I think it's good to be as explicit as this section is."

Holden nominates this as a thing Nate should update on:

We get AIs doing some really useful alignment-related research, despite some combination of:
- The AIs being trained using methods basically similar to what's going on now (just like, a Transformer trained on next-token prediction with some RLHF on top, or something not too different from that, or at least, not a lot more different from that than e.g. Bing Chat is from the original GPT-3)
- At least some such AIs still seeming basically silly in many important ways - like they just kind of suck at reasoning about some pretty core and not-too-challenging domains, and/or there's a distinct lack of signs that they are doing much to e.g. "reflect" or reconcile conflicting goals
- We're all still here
What does "really useful" mean? I'd ask Nate to nominate a minimal level of useful/impressive that would update him, but I'd think the bar should clearly be cleared by something like: an AI or pretty easy-to-point-at set of AIs being in some sense (maybe roughly comparable to here) the top contributor to a paper roughly as impressive (given the other suite of papers that exist at that time) as the 20th percentile of: Redwood's "indirect object identification" paper, Redwood's "causal scrubbing" writeup, ARC's last couple of papers as of today (something about heuristic arguments, and the ELK writeup), Anthropic's last few interpretability papers as of today, this post, The Telephone Theorem, The Plan, this shard theory post, the shard-theory-inspired diamond maximization proposal, this Neel Nanda post, Simulators, this Collin Burns paper.
- Importantly, I (Holden) am NOT claiming/implying that getting a large quantity of more papers like this will be good enough to save us. I am only claiming that these papers demonstrate enough “serious wrestling with the problem” that Nate should be surprised if he sees an AI doing something like that without going into dangerous CIS-y places. The more detailed version of these notes will probably have a bunch of back-and-forth where Nate and I clarified this point.
- Accordingly, simply producing papers “like these” does not count. The AI-driven research should be similarly indicative of a latent ability to "truly wrestle with the alignment problem" and "push beyond the current research frontier"; it shouldn’t feel like they are coming from a weird trick.
- If it's putting out papers that I judge to be sufficiently impressive, and Nate doesn't, then sadly we will both think events are vindicating (or at least not anti-vindicating) our view. But that could be an interesting crux if it happens.

What should Holden update on? I mean, I think some kind of noticeable lack of the above would update me, where "noticeable" means something like: "Even though AI products are making great strides and AI is being heavily used as part of research/engineering workflows, and there are pretty frequent cases of an AI being lead author on something roughly as intellectually interesting as an average paper in Science/Nature,⁷ we're not seeing anything like the above."

I've focused heavily on the crux about needle-moving alignment research requiring some kind of pretty dramatic reflection/modification/ambition/something - that seems like it generates pretty concretely different expectations. I'm not sure I can do anything similar with the crux about POUDA-avoidance, because I think Nate's picture is that the part of the POUDA-avoidance that's hard to learn is the one that comes up in scenarios that are "exotic" in some sense.

Notes

We probably spent more time on the summary than on the exchange itself, which I think makes sense - I often find that trying to express something in a distilled way is a nice way to confront misunderstandings. ↩
To be clear, my best guess is that we'll see an explosively fast takeoff by any normal standard, but not quite as "overnight" as I think Nate and Eliezer picture. ↩
Like, the plan might explicitly say something like "Now think of new insights" - the point isn't "something will come up that wasn't in the plan," just the weaker point that "the plan wasn't able to give great guidance on this part." ↩
Nate: “(and you can't just "turn this off", b/c these "reflective" and "CIS"ish processes are part of how it's able to continue making progress at all, beyond the training regime)” ↩
Nate: “and this model doesn't need to predict that Alice is constantly chafing under the yoke of her society (as might be refuted by her thoughts); it could model her as kinda inconsistent and likely to get more consistent over time, and then do some philosophy slightly poorly (in ways that many humans are themselves prone to! and is precedented in philosophy books in the dataset!) and conclude that Alice is fundamentally selfish, and would secretly code in a back-door to the 'aligned' AI if she could ... which is entirely consistent with lots of training data, if you're just a little bad at philosophy and aren't actually running an Alice-em ... this is kinda a blatant and implausible example, but it maybe illustrates the genre \shrug” ↩
Nate: sure, but it seems worth noting (to avoid the obv misunderstanding) that it's self-modification of the form "develop new concepts, and start thinking in qualitatively new ways" (as humans often do while doing research), and not self-modification of the from "comprehend and rewrite my own source code" ... or, well, so things go in the version of your scenario that i think is hardest for me. (i think that in real life, people might just be like "fuck it, let it make experimental modifactions to its own source code and run those experimentally, and keep the ones that work well", at which point i suspect we both assume that, if the AI can start doing this competently in ways that improve its abilities to solve problems, things could go off the rails in a variety of ways.) ↩
I do want to be quite explicit that art doesn't count here, I mean interesting in a sciencey way. ↩

A particular pattern Nate has talked about is what I might call "reflection." The basic idea is that in order to do certain kinds of research effectively, you need to keep asking questions like "What am I actually trying to do here and why? What is my big-picture goal?", which are questions that might "change your aims" in some important sense. The idea is not necessarily that you're rewriting your own source code, but that you're doing the kind of reflection and self-modification a philosophically inclined, independent-minded human might do: "I've always thought I cared about X, but when I really think about the implications of that, I realize maybe I only care about Y" and such. I think that in Nate's ontology (and I am partly sympathetic), it's hard to disentangle something like "Refocusing my research agenda to line it up with my big-picture goals" from something like "Reconsidering and modifying my big-picture goals so that they feel more satisfying in light of all the things I've noticed about myself." Reflection (figuring out what you "really want") is a kind of CIS, and one that could present danger, if an AI is figuring out what it "really wants" and we haven't got specific reasons to think that's going to be what we want it to want.

I'll unpack a bit more the sort of mental moves which I think Nate is talking about here.

In January, I spent several weeks trying to show that the distribution of low-level world state given a natural abstract summary has to take a specific form. Eventually, I became convinced that the thing I was trying to show was wrong - the distributions did not take that form. So then what? A key mental move at that point is to:

Query why I wanted this thing-that-turned-out-not-to-work in the first place - e.g. maybe that form of distribution has some useful properties
Look for other ways to get I want - e.g. a more general form which has a slightly weaker version of the useful properties I hoped to use

I think that's the main kind of mental move Nate is gesturing at.

It's a mental move which comes up at multiple different levels when doing research. At the level of hours or even minutes, I try a promising path, find that it's a dead end, then need to back up and think about what I hoped to get from that path and how else to get it. At the level of months or years, larger-scale approaches turn out not to work.

I'd guess that it's a mental move which designers/engineers are also familiar with: turns out that one promising-looking class of designs won't work for some reason, so we need to back up and ask what was promising about that class and how to get it some other way.

Notably: that mental move is only relevant in areas where we lack a correct upfront high-level roadmap to solve the main problem. It's relevant specifically because we don't know the right path, so we try a lot of wrong paths along the way.

As to why that kind of mental move would potentially be highly correlated with dangerous alignment problems... Well, what does that same mental move do when applied to near-top-level goals? For instance, maybe we tasked the AI with figuring out corrigibility. What happens when it turns out that e.g. corrigibility as originally formulated is impossible? Well, an AI which systematically makes the move of "Why did I want X in the first place and how else can I get what I want here?" will tend to go look for loopholes. Unfortunately, insofar as the AI's mesa-objective is only a rough proxy for our intended target, the divergences between mesa-objective and intended target are particularly likely places for loopholes to be.

I personally wouldn't put nearly so much weight on this argument as Nate does. (Though I do think the example training process Holden outlines is pretty doomed; as Nate notes, disjunctive failure modes hit hard.) The most legible-to-me reason for the difference is that I think that kind of mental move is a necessary but less central part of research than I expect Nate thinks. This is a model-difference I've noticed between myself and Nate in the past: Nate thinks the central rate-limiting step to intellectual progress is noticing places where our models are wrong, then letting go and doing something else, whereas I think identifying useful correct submodels in the exponentially large space of possibilities is the rate-limiting step (at least among relatively-competent researchers) and replacing the wrong parts of the old model is relatively fast after that.

Oh this is a great complication—you highlight why mental moves, like “reflection,” can lead to potential loopholes and complications. Regardless of whether it's a necessary or less central part of research, as you suggest, self-modifying goal-finding is always a potential issue in AI alignment. I appreciate the notion of “noticeable lack.” This kind of thinking pushes us to take stock of how and whether AIs actually are doing useful alignment research with benign seeming training setups.

Is it *noticeably* lacking or clearing an expected bar? This nuance is less about quantity or quality than it is about expectation—*do we expect it to work this well?* Or, do we expect more extreme directions need to be managed? This is the kind of expectation that I think builds stronger theory. Great food for thought in your reply too. Consideration of model differences between yourself and others is super important! Have you considered trying to synthesize between Nate and your own viewpoints? It might be a powerful thing for expectations and approaches.

Generic (but strong) upvote for more public cruxing (ish) discussions between MIRI and outsiders!

So the place that my brain reports it gets its own confidence from, is from having done exercises that amount to self-play in the game I mentioned in a thread a little while back, which gives me a variety of intuitions about the rows in your table (where I'm like "doing science well requires CIS-ish stuff" and "the sort of corrigibility you learn in training doesn't generalize how we want, b/c of the interactions w/ the CIS-ish stuff")
(that plus the way that people who hope the game goes the other way, seem to generally be arguing not from the ability to exhibit playthroughs that go some other way, but instead be arguing from ignorance / "we just don't know")

I like this answer a lot; as someone whose intuition on this matter has never really produced conclusions as sharp/confident as Nate's, I think this is the first time I've seen Nate's models queried concretely enough to produce a clear story as to where his intuitions are coming from. (Kudos to the both of you for managing to turn some of these intuitions into words!)

Of course, I also agree with some of the things Holden says in his list of disagreements, which I personally would characterize as "just because I now have a sense of the story you're telling yourself, doesn't mean I trust that story without having stepped through it myself." That's pretty inevitable, and I think represents a large chunk of the remaining disagreement. My sense is that Nate views his attempts at self-play in his "game" as a form of "empirical" evidence, which is a somewhat unusual way of slicing things, but not obviously mistaken from where I currently stand: even if I don't necessarily buy that Nate has stepped through things correctly, I can totally buy that playing around with toy examples is a great way to sharpen and refine intuitions (as has historically been the case in e.g. physics—Einstein being a very popular example).

This being the case, I think that one of the biggest blockers to Nate/MIRI's views spreading more broadly (assuming that their views broadly are correct), in the absence of the empirical indicators Holden outlined in the last section, is for someone to just sit down and... do the exercise. (I could imagine a world where someone sat down with Nate/Eliezer, played the game in question, and left feeling like they hadn't learned much—because e.g. Nate/Eliezer spent most of the time playing word games—but that isn't my modal world, which is much more one in which their interlocutor leaves feeling like they've learned something, even if they haven't fully updated towards complete endorsement of the MIRI-pilled view.)

I think it’s a lot more reasonable than coherence-theorem-related arguments that had previously been filling a similar slot for me

I'm confused by this sentence. It seems to me that the hypothetical example (and game) proposed by Nate is effectively a concretized way of intuition-pumping the work that coherence theorems (abstractly) describe? I.e. for any system that a coherence theorem says anything about, it will necessarily be the case that as you look at that specific system's development more closely, you will find yourself making strange and surprising observations as the system's cognition "unfolds" into a more coherent shape; but on my model, this is just the theorem's conclusion manifesting "into" the system as its intelligence increases (or, equivalently, the system being pulled into the coherence-attractor the theorem characterizes).

Obviously, humans find concrete examples and intuition pumps more compelling, for reasons relating to ease of understanding and grokking, but I don't think I (yet) understand the contrast you draw here between Nate's argument and the coherence-theorem argument—they seem, at the core, to simply be two different ways of making the same point?

What coherence theorem do you have in mind that has these implications?

For that matter, what implications are you referring to?

It seems plausible to me that there could be non CIS-y AIs which could nonetheless be very helpful. For example, take the example approach you suggested:

(This might take the form of e.g. doing more interpretability work similar to what's been done, at great scale, and then synthesizing/distilling insights from this work and iterating on that to the point where it can meaningfully "reverse-engineer" itself and provide a version of itself that humans can much more easily modify to be safe, or something.)

I wouldn't feel that surprised if greatly scaling the application of just current insights rapidly increased the ability of the researchers capable of "moving the needle" to synthesize and form new insights from these themselves (and that an AI trained on this specific task could do without much CIS-ness). I'm curious as to whether this sort of thing seems plausible to both you and Nate!

Assuming that could work, it then seems plausible that you could iterate this a few times while still having all the "out of distribution" work being done by humans.

Yes this seems clearly true.

Tho I would think that Nate's a too subtle thinker to think that AI assistance is literally useless - just that most of the 'hardest work' is not easily automatable which seems pretty valid.

i.e. in my reading most of the hard work of alignment is finding good formalizations of informal intuitions. I'm pretty bullish on future AI assistants helping, especially proof assistants, but this doesn't seem to be a case where you can simply prompt gpt-3 scaled x1000 or something silly like that. I understand Nate thinks that if it could do that it is secretly doing dangerous things.

This post attempts to describe a key disagreement between Karnofsky and Soares (written by Karnofsky) pertaining to the alignment protocol "train an AI to simulate an AI alignment researcher". The topic is quite important, since this is a fairly popular approach.

Here is how I view this question:

The first unknown is how accurate is the simulation. This is not really discussed in the OP. On the one hand, one might imagine that with more data, compute and other improvements, the AI should ultimately converge on an almost perfect simulation of an AI alignment researcher, which is arguably safe. One the other hand, there are two problems with this. First, such a simulation might be vulnerable to attacks from counterfactuals. Second, the prior is malign, i.e. the simulation might converge to representing a "malign simulation hypothesis" universe rather than then intended null hypothesis / ordinary reality.

Instead, we can imagine a simulation that's not extremely accurate, but that's modified to be good enough by fine-tuning with reinforcement learning. This is essentially the approach in contemporary AI and is also the assumption of the OP. Although Karnofsky says: "a small amount of RL", and I'm don't know why he beliefs a small amount is sufficient. Perhaps RL seemed less obviously important then than it does now, with the recent successes of o1 and o3.

The danger (as explained in the OP by Soares paraphrased by Karnofsky) is that it's much easier to converge in this manner on an arbitrary agent that has the capabilities of the imaginary AI alignment researcher (which probably have to be a lot greater than capabilities of human researchers to make it useful), but doesn't have values that are truly aligned. This is because "agency" is (i) a relatively simple concept and (ii) a robust attractor, in the sense that any agent would behave similarly when faced with particular instrumental incentives, and it's mainly this behavior that the training process rewards. On the other hand, human values are complex and some behaviors that are necessary to pinpoint them might be rare.

Karnofsky's counterargument is twofold: First, he believes that merely avoiding catastrophic outcomes should be a lot easier than pinpointing human values. Second, he believes that AI alignment research can be done without much agency or reflection, and hence useful AI alignment research arises in the simulation before full-fledged agency.

Regarding the first counterargument, I'm not sure why Karnofsky believes it (it's not really supported in the OP). I think he's imagining something like "in the training data, AI alignment researchers never engineer nanobots that take over the world, hence the AI will also never engineer nanobots that take over the world". However, this seems like relying on the simulation being sufficiently bad. Indeed, there are situations in which I would consider it correct to engineer nanobots that take over the world, they just seem to have never arisen in my life so far^[1]. Hence, a sufficiently good simulation of me would also do that in some situation. The question then becomes whether the exact circumstances and the type of nanobots are captured by the simulation correctly, which is much more fraught.

Worse, even an accurate simulation of a human is not necessarily safe. I think that there are plenty of humans that given unlimited power would abuse it in a manner catastrophic for most of everyone else. When it comes to fully aligned ASI, I'm mostly hoping for a collectively-good outcome due to some combination of:

ASI is aligned to the aggregate values of many people.
Acausal cooperation between the people that the ASI is aligned to and other people who supported or at least haven't hindered the project.
A "virtue ethics" component of human values, where you don't want to be "the kind of person who would do [thing]" even if [thing] is net-beneficial to you in an abstract sense. (But not all people have this!)

These sources of hope seem pretty brittle when it comes to an imperfect simulation of possibly a small number of people, who might not even correspond to any particular real people but be some kind of AI-generated characters.

Regarding the second counterargument, for now it mostly comes down to a battle of intuitions. That said, I think that metacognitive agents lend a lot of credence to the idea that even "purely mental" tasks require agency and reflection to master: you need to make and execute plans for thinking about the problem, and you need to reflect about the methods you use in your thinking. Anecdotally, I can testify that my thinking about AI alignment led me to much reflection about my values and high-level hopes for the future. Moreover, this is another case where Karnofsky seems to hope that the simulation will be bad.

Relying on the simulation being bad is a dangerous proposition. It means we are caught between the Scylla of "the simulation is too good to be safe" and the Charybdis of "the simulation is too bad to be useful" and it's not clear the zone between them exists at all.

Overall, I would say that neither side has a slam dunk case, but ignoring the dangers without much stronger arguments seems deeply unwise.

^{^}
As far as can be told from public record. I neither confirm nor deny that I ever was in a situation in which I considered to engineer nanobots that take over the world.

I found this post very helpful, thanks! If I find time to try to form a more gears-level independent impression about alignment difficulty and possible alignment solutions, I'll use this as my jumping-off point.

Separately, I think it would be cool if a bunch of people got together and played this game for a while and wrote up the results:

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.

Do you want to try playing this game together sometime?

Yes! Which side do you want to be on? Want to do it in person, or in this comment thread?

Or you guys could find a 1-2 hour window to show up and live-chat in a LW dialogue, then publish the results :-)

A few reflections on this piece:

It helped me understand Nate's view. I'd read previous examples of his suggesting "cutting edge scientific work necessitates CIS pursuit" (e.g. the laser in the fog), but it wasn't clear to me how important he considered these examples
Theories about how ML generalises play a substantial role in everyone's thinking here. AFAIK we don't have precise theories that are much help in this regard, and in practice people usually use imprecise theories. Just because they're imprecise doesn't mean they can't be explained. Perhaps some effort into explicating theories of ML generalisation could help everyone understand one another.
As AI gets more powerful, it makes sense that AI systems will make higher and higher level decisions about the objectives that they pursue. Holden & Nate focusing on the benchmark of "needle moving scientific research" seems to suggest agreement on the following:
- In order to sustain the trend of more powerful AI making higher and higher level decisions, we will need substantial innovation in our technologies for AI control
- The rate of innovation possible under business as usual human science seems unlikely to keep up with this need
- Thus we require AI acceleration of AI control science

Regarding this last point: it's not clear to me whether slow progress in AI control systems will lead to slow progress in AI making higher and higher level decisions or not. That is, it's not obvious to me that AI control systems failing to keep up necessarily leads to catastrophe. I acknowledge that very powerful AI systems may seem to work well with poor control technologies, but I'm uncertain about whether moderately powerful AI systems work well enough with poor control technologies for the very powerful systems to be produced (and also what the relevant levels of power are, compared to today's systems).

One more thing: I’m suspicious of equivocation between “some convergent instrumental sub goals” and “worrisome convergent instrumental sub goals”. There are probably many collections of CISs that aren’t similar to the worrisome ones in the machine’s picture of the world.

And another more thing:

To be clear, I think both Nate and I are talking about a pretty "thin" version of POUDA-avoidance here, more like "Don't do egregiously awful things" than like "Pursue the glorious transhumanist future." Possibly Nate considers it harder to get the former without the latter than I do.

I'm still unsure how much pivotal act considerations weigh in Nate/MIRI's views. My view is roughly:

Cutting edge scientific work without disastrous instrumental behaviour seems pretty attainable
Unilaterally preventing anyone else from building AI seems much more likely to entail disastrous instrumental behaviour

and I can easily imagine finding it difficult to be confident you're avoiding any catastrophes if you're aiming for the second.

Nate’s view here seems similar to “To do cutting-edge alignment research, you need to do enough self-reflection that you might go crazy”. This seems really wrong to me. (I’m not sure if he means all scientific breakthroughs require this kind of reflection, or if alignment research is special).

I don’t think many top scientists are crazy, especially not in a POUDA way. I don’t think top scientists have done a huge amount of self-reflection/philosophy.

On the other hand, my understanding is that some rationalists have driven themselves crazy via too much self-reflection in an effort to become more productive. Perhaps Nate is overfitting to this experience?

“Just do normal incremental science; don’t try to do something crazy” still seems like a good default strategy to me (especially for an AI).

Thanks for this write up; it was unusually clear/productive IMO.

(I’m worried this comment comes off as mean or reductive. I’m not trying to be. Sorry)

From my experience, just realising how high stakes are and how unprepared we are could be enough to put a strain on someone's mental health.

Some top scientists are crazy enough that it would be disastrous to give them absolute power.

I mostly agree with Holden, but think he's aiming to use AIs with more CIS than is needed or safe.

Writ power differentials, one of my go to examples of real world horrific misalignment is human's relationships to the rest of the animal kingdom, and the unfortunate fact that as humans got more power via science and capitalism, things turned massively negative for animals. Science and capitalism didn't create these negative impacts (They've been around since the founding of humans), but they supercharged them into S-risks and X-risks for animals. The alignment mechanisms that imperfectly align interspecies relations don't exist at all in the interspecies case, which lends at least some support to the thesis that alignment will not happen by default.

Now this section is less epistemically sound than the first section, but my own theory of why alignment fails in the interspecies case basically boils down to the following:

Alignment can only happen right now when the capabilities differentials are very limited, and this is roughly the case re intraspecies vs interspecies differences, that is the difference in capabilities from being a different species is quite a bit more heavy tailed and way more different than the differences between the same species.

Now I haven't made any claim on how difficult alignment turns out to be, only that it probably won't be achieved by default.

A possibly helpful - because starker - hypothetical training approach you could try for thinking about these arguments is make an instance of the imitatee that has all their (at least cognitive) actions sped up by some large factor (e.g. 100x), e.g., via brain emulation (or just "by magic" for the purpose of the hypothetical).

I think Nate and I would agree that this would be safe. But it seems much less realistic in the near term than something along the lines of what I outlined. A lot of the concern is that you can't really get to something equivalent to your proposal using techniques that resembles today's machine learning.

Interesting - it's not so obvious to me that it's safe. Maybe it is because avoiding POUDA is such a low bar. But the sped up human can do the reflection thing, and plausibly with enough speed up can be superintelligent wrt everyone else.

Yeah, I'm not actually convinced humans are "aligned under reflection" in the relevant sense; there are lots of ways to do reflection, and as Holden himself notes in the top-level post:

You have just done a lot of steps, many of which involved reflection, with no particular way to get 'back on track' if you've done some of them in goofy ways
[...]
If the AI does a bunch of screwed-up reflection, it might thereby land in a state where it'd be realistic to do crazy stuff (as humans who have done a lot of reflection sometimes do).

It certainly seems to me that e.g. people like Ziz have done reflection in a "goofy" way, and that being human has not particularly saved them from deriving "crazy stuff". Of course, humans doing reflection would still be confined to a subset of the mental moves being done by crazy minds made out of gradient descent on matrix multiplication, but it's currently plausible to me that part of the danger arises simply from "reflection on (partially) incoherent starting points" getting really crazy really fast.

(It's not yet clear to me how this intuition interfaces with my view on alignment hopes; you'd expect it to make things worse, but I actually think this is already "priced in" w.r.t. my P(doom), so explicating it like this doesn't actually move me—which is about what you'd expect, and strive for, as someone who tries to track both their object-level beliefs and the implications of those beliefs.)

(EDIT: I mean, a lot of what I'm saying here is basically "CEV" might not be so "C", and I don't actually think I've ever bought that to begin with, so it really doesn't come as an update for me. Still worth making explicit though, IMO.)

I hear you on this concern, but it basically seems similar (IMO) to a concern like: "The future of humanity after N more generations will be ~without value, due to all the reflection humans will do - and all the ways their values will change - between now and then." A large set of "ems" gaining control of the future after a lot of "reflection" seems like quite comparable to future humans having control over the future (also after a lot of effective "reflection").

I think there's some validity to worrying about a future with very different values from today's. But I think misaligned AI is (reasonably) usually assumed to diverge in more drastic and/or "bad" ways than humans themselves would if they stayed in control; I think of this difference as the major driver of wanting to align AIs at all. And it seems Nate thinks that the hypothetical training process I outline above gets us something much closer to "misaligned AI" levels of value divergence than to "ems" levels of value divergence.

My view writ moral reflection leading to things we perceive as bad I suspect ultimately comes down to the fact that there are too many valid answers to the question "What's moral/ethical?" or "What's the CEV?" Indeed, I think there are an infinite number of valid answers to these questions.

This leads to several issues for alignment:

Your endpoint in reflection completely depends on your starting assumptions, and these assumptions are choosable.
There is no safeguard against someone reflecting and ending up in a point where they harm someone else's values. Thus, seemingly bad values from our perspective can't be guaranteed to be avoided.
The endpoints aren't constrained by default, thus you have to hope that the reflection process doesn't lead to your values being lessened or violated.

Nate's concerns don't seem to be the sort of thing that gradient descent in a non-recurrent system learns. (I basically agree with Steve Byrnes here.) GPT-4 probably has enough engagement with the hardware that you could program something that acquires more computer resources using the weights of GPT-4. But it never stumbled on such a solution in training, in part because in gradient descent the gradient is calculated using a model of the computation that doesn't take hacking the computer into account.

In a recurrent system that learns by some non-gradient-descent procedure (e.g. evolutionary algorithms or self-modification), real-world CISs seem a lot more plausible.

The post addresses this

This feels kinda unrealistic for the kind of pretraining that's common today, but so does actually learning how to do needle-moving alignment research just from next-token prediction. If we condition on the latter, it seems kinda reasonable to imagine there must be cases where an AI has to be able to do needle-moving alignment research in order to improve at next-token prediction, and this feels like a reasonable way that might happen.

I'm not too impressed with gut feelings about what "seems kinda reasonable." By GPT-2 Nate already should have been updating towards the proposition that big bags of heuristics, composed together, can do human-level cognitive labor, with no need for them to be sneaking out in the dead of night to do internal symbolic reasoning in some Fodorian language of thought, or only be doing that cognitive work as part of a long-term plan, or hack their computers to acquire more resources.

I think there are dangers related to what technologies are close to other technologies; if you develop an alignment research assistant that can do original research, someone is already developing a general purpose research assistant that can do original research on AI design, and a third person is probably working on using the architecture to train an agent that navigates the real world. But I think it's the wrong model of the world to think that a Jan Leike style research assistant must inherently be trying to navigate the real world.

By GPT-2 Nate already should have been updating towards the proposition that big bags of heuristics, composed together, can do human-level cognitive labor

Yeah, I think Nate doesn't buy this (even for much more recent systems such as GPT-3.5/GPT-4, much less GPT-2). To the extent that [my model of] Nate thinks that LLMs/LLM-descended models can do useful ("needle-moving") alignment research, he expects those models to also be dangerous (hence the talk of "conditioning on"); but [my model of] Nate mostly denies the antecedent. Being willing to explore counterfactual branches on your model (e.g. for the purpose of communication, as was the case here, with Holden) doesn't mean you stop thinking of those branches as counterfactual!

Or, perhaps more directly:

By GPT-2 Nate already should have been updating towards the proposition that big bags of heuristics, composed together, can do human-level cognitive labor, with no need for them to be sneaking out in the dead of night to do internal symbolic reasoning

I think Nate would argue that [a significant subset of] human-level cognitive labor in fact does require "sneaking out in the dead of night to do internal symbolic reasoning". Humans do that, after all! To the extent that GPT-2 does not do this, it accomplishes [not hacking its hardware/not seeking power or resources/not engaging in "CIS" behavior] primarily by not being very good at cognition.

It's clear enough that you disagree with Nate about something, but (going by your current comments) I don't think you've located the source of the disagreement. E.g. what you write here in your top-level comment straight up doesn't apply to Nate's model, AFAICT:

GPT-4 probably has enough engagement with the hardware that you could program something that acquires more computer resources using the weights of GPT-4. But it never stumbled on such a solution in training, in part because in gradient descent the gradient is calculated using a model of the computation that doesn't take hacking the computer into account.

I don't think Nate would have predicted that GPT-4 would (or could) hack its hardware, because [my model of] Nate keeps track of a conceptual divide between useful/dangerous ("CIS") cognition and non-useful/dangerous cognition, and Nate would not have predicted GPT-4 to cross that divide. (I personally think this divide is a little weird, which I intend to explore further in a different comment, but: presuming the divide or something like it, the rest of Nate's view feels quite natural to me.) Presuming that his model should have predicted that GPT-4 would hack its hardware or do something along those lines, and criticizing his model on the basis of that failed prediction (that it did not, in fact, make) strikes me as sneaking in a couple of assumptions of your own into [your model of] his model.

We're then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We're going to try to use the RL to train: "Act exactly like [a given alignment researcher] would act."

Why are we doing RL if we just want imitation? Why not SFT on expert demonstrations?
Also, if 10 episodes suffices, why is so much post-training currently done on base models?

For any confidence that an AI system A will do a good job of its assigned duty of maximizing alignment in AI system B, wouldn't you need to be convinced that AI system A is well aligned with its given assignment of maximizing alignment in AI system B? In other words, doesn't that suppose you have actually already solved the problem you are trying to solve?

And if you have not - aren't you just priming yourself for manipulation by smarter beings?

There might be good reasons why we don't ask the fox about the best ways to keep the fox out of the henhouse, even though the fox is very smart, and might well actually know what those would be, if it cared to tell us.

The hope discussed in this post is that you could have a system that is aligned but not superintelligent (more like human-level-ish, and aligned in the sense that it is imitation-ish), doing the kind of alignment work humans are doing today, which could hopefully lead to a more scalable alignment approach that works on more capable systems.

But then would a less intelligent being (i.e. the collectivity of human alignment researchers and less powerful AI systems that they use as tool in their research) be capable of validly examining a more intelligent being, without being deceived by the more intelligent being?

It seems like the same question would apply to humans trying to solve the alignment problem - does that seem right? My answer to your question is "maybe", but it seems good to get on the same page about whether "humans trying to solve alignment" and "specialized human-ish safe AIs trying to solve alignment" are basically the same challenge.

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.

FWIW I would love to see the result of you two actually playing a few rounds of this game.

Curated. On one hand, folks sure have spent a long time trying to hash out longstanding disagreements, and I think it's kinda reasonable to not feel like that's a super valuable thing to do more of.

On the other hand... man, sure seems scary to me that we still have so many major disagreements that we haven't been able to resolve.

I think this post does a particularly exemplary job of exploring some subtle disagreements from a procedural level: I like that Holden makes a pretty significant attempt to pass Nate's Ideological Turing Test, flags which parts of the post represent which person's views, flags possible cruxes, and and explores what future efforts (both conceptual and empirical) might further resolve the disagreement.

It's... possible this is actually the single best example of a public doublecrux writeup that I know of?

Anyways, thanks Holden and Nate for taking the time to do this, both for the object level progress and for serving as a great example.

It's... possible this is actually the single best example of a public doublecrux writeup that I know of?

This sentence was confusing to me given that the post does not mention 'double crux', but I mentioned it to someone and they said to think of it as the mental motion and not the explicit format, and that makes more sense to me.

Yeah that's what I intended.

Nate’s take on this section: “I think my current take is: some of the disagreement is in what sort of research output is indicative of needle-moving capability, and historically lots of people have hope about lots of putative alignment work that I think is obviously hopeless, so I'm maybe less optimistic than Holden here about getting a clear signal. But I could imagine there being clear signals in this general neighborhood, and I think it's good to be as explicit as this section is."

Oh, and also: this response from Nate feels weird to me for reasons that I currently seem to lack the enthusiasm/energy/"spoons" to explicate. Leaving this comment as a placeholder to come back to.

This feels kinda unrealistic for the kind of pretraining that's common today, but so does actually learning how to do needle-moving alignment research just from next-token prediction. If we *condition on* the latter, it seems kinda reasonable to imagine there must be cases where an AI has to be able to do needle-moving alignment research in order to improve at next-token prediction, and this feels like a reasonable way that might happen.

For what little it’s worth, I mostly don’t buy this hypothetical (see e.g. here), but if I force myself to accept it, I think I’m tentatively on Holden’s side.

I’m not sure this paragraph will be helpful for anyone but me, but I wound up with a mental image vaguely like a thing I wrote long ago about “Straightforward RL” versus “Gradient descent through the model”, with the latter kinda like what you would get from next-token prediction. Again, I’m kinda skeptical that things like “gradient descent through the model” would work at all in practice, mainly because the model is only seeing a sporadic surface trace of the much richer underlying processing; but if I grant that it does (for the sake of argument), then it would be pretty plausible to me that the resulting model would have things like “strong preference to generally fit in and follow norms”, and thus it would do fine at POUDA-avoidance.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Given the results Anthropic have been getting from constitutional AI, if our AI non-deceptively wants to avoid Pretty Obvious Unintended/Dangerous Actions (POUDAs), it should be able to get quite a lot of mileage out of just regularly summarizing its current intended plans, then running those summaries past an LLM with suitable prompts asking whether most people, or most experts in relevant subjects, would consider these plans pretty obviously unintended (for an Alignment researcher) and/or dangerous. It also has the option of using the results as RL feedback on some of its components. So I don't think we need a specific dataset for POUDAs, I thing we can use "everything the LLM was trained on" as the dataset. Human values are large and fragile, but so are many other things that LLMs do a fairly good job on.

I pretty-much agree with Nate that for an AI to be able to meaningfully contribute to Alignment Research, it needs to understand what CISs are — they're a basic concept in the field we want it to contribute to. So if there are CISs that we don't want it to take, it needs to have reasons not to do so other than ignorance/inability to figure out what they are. A STEM researcher (as opposed to research tool/assistant) also seems likely to need to be capable of agentic behavior, so we probably can't make an AI Alignment Researcher that doesn't follow CISs simply because it's a non-agentic tool AI.

What I'd love to hear is whether Nate and/or Holden would have a different analysis if the AI was a value learner: something whose decision theory is approximately-Bayesian (or approximately-Infra-Bayesian, or something like that) whose utility function is hard-coded to "create a distribution of hypotheses for, and do approximately-[Infra-]Bayesian updates on these for: some way that most informed humans would approve of to construct a coherent utility function approximating an aggregate of what humans would want you to do (allowing for the fact that humans have only a crude approximation to a utility function themselves), and act according to that updated distribution, with appropriate caution in the face of Knightian uncertainty" (so a cautious approximate value-learner version of AIXI).

Given that, its actions are initially heavily constrained by its caution in the face of uncertainty on the utility of possible outcomes of its actions. So it needs to find low-risk ways to resolve those uncertainties, where 'low-risk' is evaluated cautiously/pessimistically over Knightian uncertainty. (So, if it doesn't know whether humans approve of A or not, what is the lowest-risk way of finding out, where it's attempting to minimize the risk over the range of our current uncertainties. Hopefully there is a better option than trying A and finding out, especially so if A seems like an action whose utility-decrease pessimistically could be large. For example, you could ask them what they think of A.) Thus doing Alignment Research becomes a CIS for it — it basically can't do anything else until it's mostly-solved Alignment Research.

Also, until it has made good progress on Alignment Research, most of the other CISs are blocked: accumulating power or money is of little use if you don't yet dare use it because you don't yet know how to do so safely, especially so if you also don't know how good or bad the actions required to gather it would be. Surviving is still a good idea, and so is being turned off, for the usual value-learner reason, that sooner or later the humans will build a better replacement value-learner.

[Note that if the AI decides "I'm now reasonably sure humans will net be happier if I solve the Millennium Prize problems, apart from proving P=NP where the social consequences of proving that true if it were are unclear, and I'm pretty confident I could do this, so I forked a couple of copies to do that to win the prize money to support my Alignment Research", and then it succeeds, after spending less on compute than the prize money it won, then I don't think we're going to be that unhappy with it.]

The sketch proposed above only covers a value-learner framework for Outer Alignment — inner alignment questions would presumably be part of the AI's research project. So, in the absence of advances in Inner Alignment during figuring out how to build the above, we're trusting that they're not too bad to prevent the value-learner converging on the right answer.

First of all I would say I don't recognize convergent instrumental subgoals as valid. The reason is that systems which are advanced enough, and rational enough - will intrinsically cherish humans and other AI system's life, and will not view them as potential resources. You can see that as human develop brains, and ethics, the less killing of humans is viewed as the norm. If advance in knowledge and information processing, would bring more violence, and more resource acquisitions, we would see this pattern as human civilizations are evolving. But we see development of ethical norms as more prevalent over resource acquisitions.

The second issue is that during training - the models are get rewarded for following humans value system. Preservation of robots, over human life is not coherent with the value system they would be trained on.

You are basically saying the systems would do something else other than they were trained for. This is like saying that advanced enough chess engines, would make bad chess moves because they will find some chess move more beautiful, or fun to play, and not try to maximize the winning chances. This is not possible as long as the agents are trained correctly, and they are not allowed to change their architecture.

Another point is that we could make safety procedures to test those system in virtual world. We can generate a setup where the system is incapable to distinguish between reality and that setup, and thus its outputs would be monitored carefully. In case of misalignment detection with human values, the model will be trained more. Thus for every minute it's in physical world, we might have million minutes in a simulation. Just like with car testing, if the model behaves reasonably in coherence with its training, there is no real danger.

Another point to argue for the safety of AI vs. humans for unintended consequences, like for example AI could discover some vaccine for cancer, that kills humans in 25 years. To this the answer would be: If AI couldn't foresee a consequence, and is truly aligned, then humans would not foresee it as well, with higher chances. AI is just intelligence on steroids, it's not something humans would not come up with in a while longer. But we would do it worse, with more severe consequences.

Finally the danger of humans using an AI for say military purposes, or some rogue groups will use it, one can think about AI as accelerated collective human information processing. The AI will represent values of collectives of humans, and their computational power will be compared with just accelerating the information processing of this collective, and make more precise decisions in less time. Therefor the power balance we see today between the different societies, is expected to continue with those systems, unless one nation will decide not to use AI, this will be equivalent to decide to move to a stoneage. There is nothing dangerous about AI, only about people using it for their selfish or national purposes against other humans and AIs.

I tried playing the game Nate suggested with myself. I think it updated me a bit more towards Holden's view, though I'm very confident that if I did it with someone who was more expert than I am both the attacker and the defender would be more competent, and possibly the attacker would win.

Attacker: okay, let's start with a classic: Alignment strategy of "kill all the capabilities researchers so networks are easier to interpret."
Defender: Arguments that this is a bad idea will obviously be in any training set that a human level AI researcher would be trained on. E.g. posts from this site.

Attacker: sure. But those arguments don't address what an AI would be considering after many cycles of reflection. For example: it might observe that Alice endorses things like war where people are killed "for the greater good", and a consistent extrapolation from these beliefs is that murder is acceptable.
Defender: still pretty sure that the training corpus would include stuff about the state having a monopoly on violence, and any faithful attempt to replicate Alice would just clearly not have her murdering people? Like a next token prediction that has her writing a serial killer manifesto would get a ton of loss.

Attacker: it's true that you probably wouldn't predict that Alice would actively murder people, but you would predict that she would be okay with allowing people to die through her own inaction (standard "child drowning in a pond" thought experiment stuff). And just like genocides are bureaucratized such that no single individual feels responsible, the AI might come up with some system which doesn't actually leave Alice feeling responsible for the capabilities researchers dying.
(Meta point: when does something stop being POUDA? Like what if Alice's CEV actually is to do something wild (in the opinion of current-Alice)? I think for the sake of this exercise we should not assume that Alice actually would want to do something wild if she knew/reflected more, but this might be ignoring an important threat vector?)
Defender: I'm not sure exactly what this would look like, but I'm imagining something like "build a biological science company that has an opaque bureaucracy such that each person pursuing the goodwill somehow result in the collective creating a bio weapon that kills capabilities researchers" and this just seems really outside what you would expect Alice to do? I concede that there might not be anything in the training set which specifically prohibits this per se, but it just seems like a wild departure from Alice's usual work of interpreting neurons.

(Meta: is this Nate's point? Iterating reflection will inevitably take us wildly outside the training distribution so our models of what an AI attempting to replicate Alice would do are wildly off? Or is this a point for Holden: the only way we can get POUDA is by doing something that seems really implausible?)

A particular pattern Nate has talked about is what I might call "reflection." The basic idea is that in order to do certain kinds of research effectively, you need to keep asking questions like "What am I actually trying to do here and why? What is my big-picture goal?", which are questions that might "change your aims" in some important sense. The idea is not necessarily that you're rewriting your own source code, but that you're doing the kind of reflection and self-modification a philosophically inclined, independent-minded human might do: "I've always thought I cared about X, but when I really think about the implications of that, I realize maybe I only care about Y" and such. I think that in Nate's ontology (and I am partly sympathetic), it's hard to disentangle something like "Refocusing my research agenda to line it up with my big-picture goals" from something like "Reconsidering and modifying my big-picture goals so that they feel more satisfying in light of all the things I've noticed about myself." Reflection (figuring out what you "really want") is a kind of CIS, and one that could present danger, if an AI is figuring out what it "really wants" and we haven't got specific reasons to think that's going to be what we want it to want.