LESSWRONG
LW

AI
Frontpage

147

Foom & Doom 2: Technical alignment is hard

by Steven Byrnes
23rd Jun 2025
AI Alignment Forum
34 min read
54

147

Ω 55

AI
Frontpage

147

Ω 55

Foom & Doom 2: Technical alignment is hard
12ryan_greenblatt
1Aprillion
9ryan_greenblatt
1Aprillion
2ryan_greenblatt
10Jeremy Gillen
9Steven Byrnes
5Jeremy Gillen
4Steven Byrnes
4Jeremy Gillen
5Steven Byrnes
7ryan_greenblatt
3Jeremy Gillen
2ryan_greenblatt
2Jeremy Gillen
9Eli Tyre
8Daniel Kokotajlo
2Steven Byrnes
7nostream
4Steven Byrnes
6plex
2Charlie Steiner
2plex
5Stephen McAleese
4Steven Byrnes
5Stephen McAleese
6Steven Byrnes
4Eli Tyre
2jessicata
3Eli Tyre
11Steven Byrnes
3Jonas Hallgren
4Steven Byrnes
1Jonas Hallgren
2Eli Tyre
2Eli Tyre
2Eli Tyre
2AnthonyC
2Steven Byrnes
2AnthonyC
1Joel Burget
4Steven Byrnes
1Akradantous Adoxastous
1Expertium
3Steven Byrnes
2Seth Herd
1Expertium
2Seth Herd
2Seth Herd
1Joey Marcellino
3Steven Byrnes
1S. Alex Bradt
2Steven Byrnes
1S. Alex Bradt
New Comment
54 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:42 AM
[-]ryan_greenblatt19dΩ4125

@ryan_greenblatt likewise told me (IIRC) “I think things will be continuous”, and I asked whether the transition in AI zeitgeist from RL agents (e.g. MuZero in 2019) to LLMs counts as “continuous” in his book, and he said “yes”, adding that they are both “ML techniques”. I find this perspective baffling—I think MuZero and LLMs are wildly different from an alignment perspective. Hopefully this post will make it clear why. (And I think human brains work via “ML techniques” too.)

I don't think this is an accurate paraphrase of my perspective.

My view is:

  • Both of MuZero and LLMs are within an ML paradigm and I expect that many/most of the techniques I think about transfer between AGI made using either style of methods.
  • I think that you can continously transition between MuZero and LLMs and I expect that if a MuZero like paradigm happens, this is probably what will happen. (As in, you'll use LLMs as a component in the MuZero approach or similar.)
  • I don't expect that a transition from the current LLM paradigm to the MuZero-style paradigm would result in massively discontinuous takeoff speeds (as in, I think takeoff speeds are continuous) because before you have a full AGI from the MuZero style approach, you'll have a worse AI from the MuZero approach. See [this comment for more discussion](https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/foom-and-doom-1-brain-in-a-box-in-a-basement?commentId=mZKP2XY82zfveg45B). This is even aside from continuously transitioning between the two.
  • In practice, I think that the actual historical transition from MuZero (or other pure RL agents) to LLMs didn't cause a huge trend break or discontinuity in relevant downstream metrics (e.g. benchmark scores).
  • I agree that in practice MuZero and LLMs weren't developed continuously. I would say that this is because the MuZero approach didn't end up being that useful for any of the tasks we cared about and was outcompeted pretty dramatically.
  • I agree these can be very different from an alignment perspective but things like RLHF, interpretability, and control seem to me like they straightforwardly can be transfered.
Reply
[-]Aprillion8d10

hm, as a non-expert onlooker, I found the paraphrase pretty accurate.. for sure it sounds more reasonable in your own words here compared to the oversimplified summary (so thank you for clarification!), but as far as accuracy of summaries go, this one was top tier IMHO (..have you seen the stuff that LLMs produce?!)

Reply
[-]ryan_greenblatt8d90

I agree that my view is that they can count as continuous (though the exact definition of the word continuous can matter!), but then the statement "I find this perspective baffling— think MuZero and LLMs are wildly different from an alignment perspective" isn't really related to this from my perspective. Like things can be continuous (from a transition or takeoff speeds perspective) and still differ substantially in some important respects!

Reply
[-]Aprillion8d10

I somehow completely agree with both of your perspectives, have you tried to ban the word "continuous" in your discussions yet? (on the other hand, I don't think it should be a crux, probably just ambiguous meaning like "sound" in the "when a tree falls" thingy ... but I would be curious if you would be able to agree on the 2 non-controversial meanings between the 2 of you)

It reminds me of stories about gradualism / saltationism debate in evolutionary biology after gradualism won and before the idea of punctuated equilibrium... Parents and children are pretty discreet units, but gene pools over millions of years are pretty continuous from the perspective of an observer long long time later who is good at spotting low-frequency patterns ¯\_(ツ)_/¯

For a researcher, even GPT 3.5 to 4 might have been a big jump in terms of compute budget approval process (and/or losing a job from disbanding a department). And the same event on a benchmark might look smooth - throughout multiple big architecture changes a la the charts that illustrate Moore's law - the sweat and blood of thousands of engineers seems kinda continuous if you squint enough.

And what even is "continuous" - general relativity is a continuous theory, but my phone calculates my GPS coordinates with numerical methods, time dilation from gravity field/the geoid shape is just approximated and nanosecond(-ish) precision is good enough to pin me down as much as I want (TBH probably more precision that I would choose myself as a compromise with my battery life). Real numbers are continuous, but they are not computable (I mean in practice in our own universe, I don't care about philosophical possibilities), so we approximate them with a finite set of kinda shitty rational-ish numbers for which even 0.1 + 0.2 == 0.3 is false (in many languages, including JS in a browser console and in Python)..

Some stuff will work "the same" in the new paradigm, some will be "different" - does it matter whether we call it (dis)continuous, or do we know already what to predict in more detail?

Reply
[-]ryan_greenblatt7d20

I somehow completely agree with both of your perspectives, have you tried to ban the word "continuous" in your discussions yet?

I agree taboo-ing is a good approach in this sort of case. Talking about "continuous" wasn't a big part of my discussion with Steve, but I agree if it was.

Reply
[-]Jeremy Gillen18dΩ5100

(Overall I like these posts in most ways, and especially appreciate the effort you put into making a model diff with your understanding of Eliezer's arguments)

Eliezer and some others, by contrast, seem to expect ASIs to behave like a pure consequentialist, at least as a strong default, absent yet-to-be-invented techniques. I think this is upstream of many of Eliezer’s other beliefs, including his treating corrigibility as “anti-natural”, or his argument that ASI will behave like a utility maximizer.

It feels like you're rounding off Eliezer's words in a way that removes the important subtlety. What you're doing here is guessing at the upstream generator of Eliezer's conclusions, right? As far as I can see in the links, he never actually says anything that translates to "I expect all ASI preferences to be over future outcomes"? It's not clear to me that Eliezer would disagree with "impure consequentialism".

I think you get closest to an argument that I believe with (2):

(2) The Internal Competition Argument: We’ll wind up with pure-consequentialist AIs (absent some miraculous technical advance) because in the process of reflection within the mind of any given impure-consequentialist AI, the consequentialist preferences will squash the non-consequentialist preferences.

Where I would say it differently, like: An AI that has a non-consequentialist preference against personally committing the act of murder won't necessarily build its successor to have the same non-consequentialist preference[1], whereas an AI that has a consequentialist preference for more human lives will necessarily build its successor to also want more human lives. Non-consequentialist preferences need extra machinery in order to be passed on to successors. (And building successors is a similar process to self-modification).

As another example, I’ve seen people imagine non-consequentialist preferences as “rules that the AI grudgingly follows, while searching for loopholes”, rather than “preferences that the AI enthusiastically applies its intelligence towards pursuing”.

I think you're misrepresenting/misunderstanding the argument people are making here. Even when you enthusiastically apply your intelligence toward pursuing a deontological constraint (alongside other goals), you implicitly search for "loopholes" in that constraint, i.e. weird ways to achieve all of your goals that don't involve violating the constraint. To you, they aren't loopholes, they're clever ways to achieve all goals.

  1. ^

    Perhaps this feels intuitively incorrect. If so, I claim that's because your preferences against committing murder are supported by a bunch of consequentialist preferences for avoiding human suffering and death. A real non-consequentialist preference is more like the disgust reaction to e.g. picking up insects. Maybe you don't want to get rid of your own disgust reaction, but you're okay finding (or building) someone else to pick up insects for you if that helps you achieve your goals. And if it became a barrier to achieving your other goals, maybe you would endorse getting rid of your disgust reaction.

Reply
[-]Steven Byrnes16dΩ690

Thanks!

Hmm, here’s a maybe-interesting example (copied from other comment):

If an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible to me. 

What’s happening is that this example is in the category “I want the world to continuously retain a certain property”. That’s a non-indexical desire, so it works well with self-modification and successors. But it’s also not-really-consequentialist, in the sense that it’s not (just) about the distant future, and thus doesn’t imply instrumental convergence (or at least doesn’t imply every aspect of instrumental convergence at maximum strength).

(This is a toy example to illustrate a certain point, not a good AI motivation plan all-things-considered!)

Speaking of which, is it possible to get stability w.r.t. successors and self-modification while retaining indexicality? Maybe. I think things like “I want to be virtuous” or “I want to be a good friend” are indexical, but I think we humans kinda have an intuitive notion of “responsibility” that carries through to successors and self-modification. If I build a robot to murder you, then I didn’t pull the trigger, but I was still being a bad friend. Maybe you’ll say that this notion of “responsibility” allows loopholes, or will collapse upon sufficient philosophical understanding, or something? Maybe, I dunno. (Or maybe I’m just mentally converting “I want to be a good friend” into the non-indexical “I want you to continuously thrive”, which is in the category of “I want the world to continuously retain a certain property” mentioned above?) I dunno, I appreciate the brainstorming.

Reply
[-]Jeremy Gillen9d50

“I want the world to continuously retain a certain property”. That’s a non-indexical desire, so it works well with self-modification and successors.

I agree that goals like this work well with self-modification and successors. I'd be surprised if Eliezer didn't. My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It's strawmanning. And it isn't supported by any of the links you . I think you must have some mistaken assumption about Eliezer's views that is leading you to infer that he believes AIs must only have preferences over the distant future. But I can't tell what it is. One guess is: to you, corrigibility only looks hard/unnatural if preferences are very strictly about the far future, and otherwise looks fairly easy.

But it’s also not-really-consequentialist, in the sense that it’s not (just) about the distant future, and thus doesn’t imply instrumental convergence (or at least doesn’t imply every aspect of instrumental convergence at maximum strength).

I would still call those preferences consequentialist, since the consequences are the primary factor that determines the actions. I.e. the behaviour is complicated, but in a way that easy to explain once you know what the behaviour is aimed at achieving. They're even approximately long-term consequentialist, since the actions are (probably?) mostly aimed at the long-term future. The strict definition you call "pure consequentialism" is a good approximation or simplification of this, under some circumstances, like when value adds up over time and therefore the future is a bigger priority than the immediate present.

No one I know has argued that AI or rational people can only care about the distant future. People spend money to visit a theme park sometimes, in spite of money being instrumentally convergent.


Maybe you’ll say that this notion of “responsibility” allows loopholes, or will collapse upon sufficient philosophical understanding, or something? Maybe, I dunno.

Some versions of that does have loopholes, but overall I think I agree that you could get a lot of stability that way. (But as far as I can tell, the versions with fewer loopholes look more like consequence-based goals rather than rules that say which kinds of local actions-sequences are good and bad).

(Or maybe I’m just mentally converting “I want to be a good friend” into the non-indexical “I want you to continuously thrive”, which is in the category of “I want the world to continuously retain a certain property” mentioned above?)

Yeah this is exactly what I had an issue with in my sibling discussion with Ryan. He seems to think {integrity,honesty,loyalty} are deontological, whereas the way they are implemented in me is as a mix of consequentialist reasoning (e.g. some components are "does this person end up better off, by their own lights?", "do they understand what I'm doing and why?") and a bunch of soft rules designed to reduce the chances that I accidentally rationalise actions that are ultimately hurtful for complicated reasons that are difficult to see in the moment (e.g. "in the course of my plan, don't cross privacy boundaries that likely lead me to gain information that they might not have felt comfortable with me knowing"). But the rules aren't a primary driver of action, they are relatively weak constraints that quickly rule out bad plans (that almost always would have been bad for consequentialist reasons).

For me, it's similar when I want to be a good friend.

Reply
[-]Steven Byrnes5d40

My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It's strawmanning. And it isn't supported by any of the links you .

For the record, my OP says something weaker than that—I wrote “Eliezer and some others…seem to expect ASIs to behave like a pure consequentialist, at least as a strong default…”.

Maybe this is a pointless rabbit’s hole, but I’ll try one more time to argue that Eliezer seems to have this expectation, whether implicitly or explicitly, and whether justified or not:

For example, look at Eliezer’s Coherent decisions imply consistent utilities, and then reflect on the fact that knowing that an agent is “coherent”, a.k.a. a “utility maximizer”, tells you nothing at all about its behavior, unless you make additional assumptions about the domain of its utility function (e.g. that the domain is ‘the future state of the world’). To me it seems clear that

  • Either Eliezer is making those “additional assumptions” without mentioning them in his post, which supports my claim that pure-consequentialism is (to him) a strong default;
  • Or his post is full of errors, because for example he discusses whether an AI will be “visibly to us humans shooting itself in the foot”, when in fact it’s fundamentally impossible for an external observer to know whether an agent is being incoherent / self-defeating or not, because (again) coherent utility-maximizing behaviors include absolutely every possible sequence of actions.
Reply
[-]Jeremy Gillen4d40

Sorry if I misrepresented you, my intended meaning matches what you wrote. I was trying to replace "pure consequentialist" with its definition to make it obvious that it's a ridiculously strong expectation that you're saying Eliezer and others have.

Yes, assumptions about the domain of the utility function are needed in order to judge its behaviour as coherent or not. Rereading Coherent decisions imply consistent utilities, Eliezer is usually clear about the assumed domain of the utility function in each thought experiment. For example, he's very clear here that you need the preferences as an assumption: 

Have we proven by pure logic that all apples have the same utility? Of course not; you can prefer some particular apples to other particular apples. But when you're done saying which things you qualitatively prefer to which other things, if you go around making tradeoffs in a way that can be viewed as not qualitatively leaving behind some things you said you wanted, we can view you as assigning coherent quantitative utilities to everything you want.

And that's one coherence theorem—among others—that can be seen as motivating the concept of utility in decision theory."

In the hospital thought experiment, he specifies the goal as an assumption:

Robert only cares about maximizing the total number of lives saved. Furthermore, we suppose for now that Robert cares about every human life equally.

In the pizza example, he doesn't specify the domain, but it's fairly obvious implicitly. In the fruit example, it's also implicit but obvious. 

There's a few paragraphs at the end of the Allias paradox section about the (very non-consequentialist) goal of feeling certain during the decision-making process. I don't get the impression from those paragraphs that Eliezer is saying that this preference is ruled out by any implicit assumption. In fact he explicitly says that this preference isn't mathematically improper. It seems he's saying this kind of preference cuts against coherence only if it's getting in the way of more valuable decisions:

'The danger of saying, "Oh, well, I attach a lot of utility to that comfortable feeling of certainty, so my choices are coherent after all" is not that it's mathematically improper to value the emotions we feel while we're deciding. Rather, by saying that the most valuable stakes are the emotions you feel during the minute you make the decision, what you're saying is, "I get a huge amount of value by making decisions however humans instinctively make their decisions, and that's much more important than the thing I'm making a decision about." This could well be true for something like buying a stuffed animal. If millions of dollars or human lives are at stake, maybe not so much.'

I think this quote in particular invalidates your statements.

There is a whole stack of assumptions[1] that Eliezer isn't explicit about in that post. It's intended to give a taste of the reasoning that gives us probability and expected utility, not the precise weakest set of assumptions required to make a coherence argument work.

I think one thing that is missing from that post are the reasons we usually do have prior knowledge of goals (among humans and for predicting advanced AI). Among humans we have good priors that heavily restrict the goal-space, plus introspection and stated preferences as additional data. For advanced AI, we can usually use usefulness (on some specified set of tasks) and generality (across a very wide range of potential obstacles) to narrow down the goal-domain. Only after this point, and with a couple of other assumptions, do we apply coherence arguments to show that it's okay to use EUM and probability.

The reason I think this is worth talking about is that I was actively confused about exactly this topic in the year or two before I joined Vivek's team. Re-reading the coherence and advanced agency cluster of Arbital posts (and a couple of comments from Nate) made me realise I had misinterpreted them. I must have thought they were intended to prove more than they do about AI risk. And this update flowed on to a few other things. Maybe partially because the next time I read Eliezer as saying something that seemed unreasonably strong I tried to steelman it and found a nearby reasonable meaning. And also because I had a clearer idea of the space of agents that are "allowed", and this was useful for interpreting other arguments.

I'd be happy to call if that's a more convenient way to talk, although it is nice to do this publicly. Also completely happy to stop talking about this if you aren't interested, since I think your object-level beliefs about this ~match mine ("impure consequentialism" is expected of advanced AI).

  1. ^

    E.g. I think we need a bunch of extra structure about self-modification to apply anything like a money pump argument to resolute/updateless agents. I think we need some non-trivial arguments and an assumption to make the VNM continuity money pump work. I remember there being some assumption that went into complete class that I thought was non-obvious, but I've forgotten exactly what it was. The post is very clear that it's just giving a few tastes of the kind of reasoning needed to pin down utility and probability as a reasonable model of advanced agents.

Reply
[-]Steven Byrnes4d50

Probably not worth the time to further discuss what certain other people do or don’t believe, as opposed to what’s true. I remain unconvinced but added a caveat to the article just to be safe:

Why do Eliezer and others expect pure consequentialism? [UPDATE: …Or if I’m misreading Eliezer, as one commenter claims I am, replace that by: “Why might someone expect pure consequentialism?”]

Reply1
[-]ryan_greenblatt17dΩ672

Where I would say it differently, like: An AI that has a non-consequentialist preference against personally committing the act of murder won't necessarily build its successor to have the same non-consequentialist preference[1], whereas an AI that has a consequentialist preference for more human lives will necessarily build its successor to also want more human lives. Non-consequentialist preferences need extra machinery in order to be passed on to successors.

[...]

Perhaps this feels intuitively incorrect. If so, I claim that's because your preferences against committing murder are supported by a bunch of consequentialist preferences for avoiding human suffering and death. A real non-consequentialist preference is more like the disgust reaction to e.g. picking up insects. Maybe you don't want to get rid of your own disgust reaction, but you're okay finding (or building) someone else to pick up insects for you if that helps you achieve your goals. And if it became a barrier to achieving your other goals, maybe you would endorse getting rid of your disgust reaction.

Hmm, imagine we replace "disgust" with "integrity". As in, imagine that I'm someone who is strongly into the terminal moral preference of being an honest and high integrity person. I also value loyalty and pointing out ways in which my intentions might differ from what someone wants. Then, someone hires me (as an AI let's say) and tasks me with building a successor. They also instruct me: 'Make sure the AI successor you build is high integrity and avoids disempowering humans. Also, generalize the notion of "integrity, loyalty, and disempowerment" as needed to avoid these things breaking down under optimization pressure (and get your successors to do the same. And, let me know if you won't actually do a good job following these instructions, e.g. because you aren't actually that well aligned. Like, tell me if you wouldn't actually try hard and please be seriously honest with me about this."

In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn't robustly pursue the interests of the developer. That's not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.

Another way to put this is that the deontological constraints we want are like the human notions of integrity, loyalty, and honesty (and to then instruct the AI that we want this constraints propogated forward). I think an actually high integrity person/AI doesn't search for loopholes or want to search for loopholes. And the notion of "not actually loopholes" generalizes between different people and AIs I'd claim. (Because notions like "the humans remained in control" and "the AIs stayed loyal" are actually relatively natural and can be generalized.)

I'm not claiming you can necessarily instill these (robust and terminal) deontological preferences, but I am disputing they are similar to non-reflectively endorsed (potentially non-terminal) deontological constraints or urges like disgust. (I don't think disgust is an example of a deontological constraint, it's just an obviously unendorsed physical impulse!)

Reply
[-]Jeremy Gillen17dΩ130

In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn't robustly pursue the interests of the developer. That's not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.

Yes, agreed. The extra machinery and assumptions you describe seem sufficient to make sure nonconsequentialist preferences are passed to a successor.

I think an actually high integrity person/AI doesn't search for loopholes or want to search for loopholes.

If I try to condition on the assumptions that you're using (which I think include a central part of the AIs preferences having a true-but-maybe-approximate pointer toward the instruction-givers preferences, and also involves a desire to defer or at least flag relevant preference differences) then I agree that such an AI would not search for loopholes on the object-level.

I'm not sure whether you missed the straightforward point I was trying to make about searching for loopholes, or whether you understand it and are trying to point at a more relevant-to-your-models scenario? The straightforward point was that preference-like objects need to be robust to search. Your response reads as "imagine we have a bunch of higher-level-preferences and protective machinery that already are robust to optimisation, then on the object level these can reduce the need for robustness". This is locally valid. 

I don't think its relevant because we don't know how to build those higher-level-preferences and protective machinery in a way that is itself very robust to the OOD push that comes from scaling up intelligence, learning, self-correcting biases, and increased option-space.

(I don't think disgust is an example of a deontological constraint, it's just an obviously unendorsed physical impulse!)

Some people reflectively endorse their own disgust at picking up insects, and wouldn't remove it if given the option. I wanted an example of a pure non-consequentialist preference, and I stand by it as a good example.

deontological constraints we want are like the human notions of integrity, loyalty, and honesty

Probably we agree about this, but for the sake of flagging potential sources of miscommunication: if I think about the machinery involved in implementing these "deontological" constraints, there's a lot of consequentialist machinery involved (but it's mostly shorter-term and more local than normal consequentialist preferences).

Reply
[-]ryan_greenblatt17d*Ω220

I was trying to argue that the most natural deontology-style preferences we'd aim for are relatively stable if we actually instill them. So, I think the right analogy is that you either get integrity+loyalty+honesty in a stable way, some bastardized version of them such that it isn't in the relevant attractor basin (where the AI makes these properties more like what the human wanted), or you don't get these things at all (possibly because the AI was scheming for longer run preferences and so it faked these things).

And I don't buy that the loophole argument applies unless the relevant properties are substantially bastardized. I certainly agree that there exist deontological preferences that involve searching for loopholes, but these aren't the one people wanted. Like, I agree preferences have to be robust to search, but this is sort of straightforwardly true if the way integrity is implemented is at all kinda similar to how humans implement it.

Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation again comes down to "you get scheming", "your behavioural tests look bad, so you try again", "your behavioural tests look fine, and you didn't have scheming, so you probably basically got the properties you wanted if you were somewhat careful".

As in, I think we can at least test for the higher level preferences we want in the absence of scheming. (In a way that implies they are probably pretty robust given some carefulness, though I think the chance of things going catastropically wrong is still substantial.)

(I'm not sure if I'm communicating very clearly, but I think this is probably not worth the time to fully figure out.)


Personally, I would clearly pass on all of my reflectively endorsed deontological norms to a successor (though some of my norms are conditional on aspects of the situation like my level of intelligence and undetermined at the moment because I haven't reflected on them, which is typically undesirable for AIs). I find the idea that you would have a reflectively endorsed deontological norm (as in, you wouldn't self modify to remove it) that you wouldn't pass on to a successor bizarre: what is your future self if not a successor?

Reply
[-]Jeremy Gillen17d20

I was trying to argue that the most natural deontology-style preferences we'd aim for are relatively stable if we actually instill them.

Trivial and irrelevant though if true-obedience is part of it, since that's magic that gets you anything you can describe.

if the way integrity is implemented is at all kinda similar to how humans implement it.

How do humans implement integrity?

Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation comes down to "you get scheming", "your behavioural tests look bad, so you try again", "your behavioural tests look fine, and you didn't have scheming, so you probably basically got the properties you wanted if you were somewhat careful".

You're just stating that you don't expect any reflective instability, as an agent learns and thinks over time? I've heard you say this kind of thing before, but haven't heard an explanation. I'd love to hear your reasoning? In particular since it seems very different from how humans work, and intuitively surprising for any thinking machine that starts out a bit of a hacky mess like us. (I could write out an object-level argument for why reflective instability is expected, but it'd take some effort and I'd want to know that you were going to engage with it).

Reply1
[-]Eli Tyre18d95

Third and most importantly, if it is possible to make LLM-AGIs, then I think it would probably happen via eliminating all the reasons that today’s LLMs are not egregiously misaligned! In particular, I expect that they would involve the behavior being determined much more by RL and much less by pretraining (which brings in the concerns of §2.3–§2.4), and that they would somehow allow for open-ended continuous learning (which brings in the concerns of §2.5–§2.6).

So this is currently my view. I expect us to do a bunch of RL in various ways using LLM pre-training as a foundational step that gets us AIs that can choose actions that are coherent enough that we can do RL on them. Possibly we will also need various "continuous learning / long term memory / flexible-concept techniques that don't fall straight out of the RL (though also, maybe these functions will fall straight out of enough RL, I don't know). This will indeed reintroduce all the problems of RL, and erode away the safety properties of LLMs.

BUT, doing this way, we do get some intermediate model organisms, that don't have all of the crucial capabilities of the future superintelligence, but do have most of them. And we can maybe develop alignment techniques that work well on our RL-LLMs as we gradually layer in more of the mechanisms that make them dangerous.

On this view, the "last step", where we finally put together all the pieces for a complete learning and acting agent, is pretty scary, because if we're not very carful, this will be our "first critical try", and it will be a point at which we should particularly expect that our previous techniques will break down. 

And as you note, we should expect that shortly after this, the ASI-LLMs will discover more efficient ways to makes ASI and the world is in trouble at that point, though in less trouble if we did a good job making competent benevolent ASI-LLMs, since they'll be able to handle the situation better than humanity would be able to.

But in my novice's opinion, this seems maybe a better path than building ASI the efficient way from scratch?

Reply
[-]Daniel Kokotajlo18dΩ680

se RL reward functions are written in code, not in natural language.

Often though they involve using LLMs or humans to make fuzzy judgment calls e.g. about what is or isn't an obedient response to an instruction.

Reply
[-]Steven Byrnes18dΩ220

My discussion in §2.4.1 is about making fuzzy judgment calls using trained classifiers, which is not exactly the same as making fuzzy judgment calls using LLMs or humans, but I think everything I wrote still applies.

Reply
[-]nostream7d70

Thanks for the detailed post. I'd like to engage with one specific aspect - the assumptions about how RL might work with scaled LLMs. I've chosen to focus on LLM architectures since that allows more grounded discussion than novel architectures; I am in the "LLMs will likely scale to ASI" camp, but much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them. (If a random guy develops sudden ASI in his basement via some novel architecture, I agree that that tends to end very poorly.)

The post series views RL as mathematically specified reward functions that are expressible in a few lines of Python, which naturally leads to genie/literal-interpretation concerns. However, present day RL is more complicated and nuanced:

  • RLHF and RLAIF operate on human preferences rather than crisp mathematical objectives
  • Labs are expanding RLVR (RL from Verifiable Rewards) beyond simple mathematical tasks to diverse domains
  • The recent Dwarkesh episode with Sholto Douglas and Trenton Bricken (May '25) discusses how labs are massively investing in RL diversity and why they reject the "RL just selects from the pretraining distribution" critique (which may apply to o1-scale compute but likely not o3 and even less so o5-scale)

We're seeing empirical progress on reward hacking:

  • Claude 3.7 exhibited observable reward hacking in deployments
  • Anthropic's response was to specifically address this, resulting in Claude 4 hacking significantly less
    --> This demonstrates both that labs have strong incentives to reduce reward hacking and that they're making concrete progress

The threat model requires additional assumptions: To get dangerous reward hacking despite these improvements, we'd need models that are situationally aware enough to selectively reward hack only in specific calculated scenarios while avoiding detection during training/evaluation. This requires much more sophistication and subtlety than the current reward hacks.

Additionally, one could imagine RL environments specifically designed to train against reward hacking behaviors, teaching models to recognize and avoid exploiting misspecified objectives. When training is diversified across many environments and objectives, systematic hacking becomes increasingly difficult. None of this definitively proves LLMs will scale safely to ASI, but it does suggest the risk is less than proposed here.

Reply
[-]Steven Byrnes7d*40
  • In §2.4.1 I talk about learned reward functions.
  • In §2.3.5 I talk about whether or not there is such a thing as “RLVR done right” that doesn’t push towards scheming / . My upshot is:
    • I’m mildly skeptical (but don’t feel super-strongly) that you can do RLVR without pushing towards scheming at all.
    • I agree with you that there’s clearly room for improvement in making RLVR push towards scheming less on the margin.

much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them

If the plan is what I call “the usual agent debugging loop”, then I think we’re doomed, and it doesn’t matter at all whether this debugging loop is being run by a human or by an LLM, and it doesn’t matter at all whether the reward function being updated during this debugging loop is being updated via legible code edits versus via weight-updates within an inscrutable learned classifier.

The problem with “the usual agent debugging loop”, as described at that link above, is that more powerful future AIs will be capable of , and you can’t train those away by the reward function because as soon as the behavior manifests even once, it’s too late. (Obviously you can and should try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.)

The threat model requires additional assumptions: To get dangerous reward hacking despite these improvements, we'd need models that are situationally aware enough to selectively reward hack only in specific calculated scenarios while avoiding detection during training/evaluation. This requires much more sophistication and subtlety than the current reward hacks.

As a side-note, I’m opposed to the recent growth of the term “reward hacking” as a synonym for “lying and cheating”. I’m talking about scheming and treacherous turns, not “obvious lying and cheating in deployment” which should be solvable by ordinary means just like any other obvious behavior. Anyway, current LLMs seem to already have enough situational awareness to enable treacherous turns in principle (see the Alignment Faking thing), and even if they didn’t, future AI certainly will, and future AI is what I actually care about.

Reply
[-]plex20d60

Because in brain-like AGI, the reward function is written in Python (or whatever), not in natural language.

Yup. I'd bet some people will reply with something like "why not define the reward function in natural language, like constitutional AI". I think this fails due to strong optimization finding the most convenient (for it, not us) settings of free parameters left by fuzzy statistical things like words, and if you give it a chance to feed back into the definitions via training data or do online learning etc gets totally wrecked by semantic drift.

Reply
[-]Charlie Steiner19d20

And don't you think 500 lines of Python also "fails due to" having unintended optima?

I've put "fails due to" in scare quotes because what's failing is not every possible approach, merely almost all samples from approaches we currently know how to take. If we knew how to select python code much more cleverly, suddenly it wouldn't fail anymore. And ditto for if we knew how to better construct reward functions from big AI systems plus small amounts of human text or human feedback.

Reply11
[-]plex19d20

Oh no, almost all possible 500 lines of python are also bad.

Reply
[-]Stephen McAleese9d*Ω350

In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).

But it seems likely to me that programmers won't know what code to write for the reward function since it would be hard to encode complex human values. In Superintelligence, Nick Bostrom calls this manual approach "direct specification" of values and argues that it's naive. Instead, it seems likely to be that programmers will continue to use reward learning algorithms like RLHF where:

  1. The human programmers have a dataset of correct behaviors or a natural language description of what they want and they use this information to create a reward function or model automatically (e.g. Text2Reward).
  2. This learned reward model or generated code is used to train the policy.

If this happens then I think the evolution analogy would apply where there is some outer optimizer like natural selection that is choosing the reward function and then the reward function is the inner objective that is shaping the AI's behavior directly.

Edit: see AGI will have learnt reward functions for an in-depth post on the subject.

Reply
[-]Steven Byrnes9dΩ340

In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).

That’s not quite my position.

Per §2.4.2, I think that both outer alignment (specification gaming) and inner alignment (goal misgeneralization) are real problems. I emphasized outer alignment more in the post, because my goal in §2.3–§2.5 was not quite “argue that technical alignment of brain-like AGI will be hard”, but more specifically “argue that it will be harder than most LLM-focused people are expecting”, and LLM-focused people are already thinking about inner alignment / goal misgeneralization.

I also think that a good AGI reward function will be a “non-behaviorist” reward function, for which the definition of inner versus outer misalignment kinda breaks down in general.

But it seems likely to me that they programmers won't know what code to write for the reward function since it would be hard to encode complex human values…

I’m all for brainstorming different possible approaches and don’t claim to have a good plan, but where I’m at right now is:

(1) I don’t think writing the reward function is doomed, and I don’t think it corresponds to “encoding complex human values”. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).

(2) I do think RLHF-like solutions are doomed, for reasons discussed in §2.4.1.

(3) I also think Text2Reward is a doomed approach in this context because (IIUC) it’s fundamentally based on what I call “the usual agent debugging loop”, see my “Era of Experience” post §2.2: “The usual agent debugging loop”, and why it will eventually catastrophically fail. Well, the paper is some combination of that plus “let’s just sit down and think about what we want and then write a decent reward function, and LLMs can do that kind of thing too”, but in fact I claim that writing such a reward function is a deep and hairy conceptual problem way beyond anything you’ll find in any RL textbook as of today, and forget about delegating it to LLMs. See §2.4.1 of that same “Era of Experience” post for why I say that.

Reply
[-]Stephen McAleese8dΩ350

Thank you for the reply!

Ok but I still feel somewhat more optimistic about reward learning working. Here are some reasons:

  • It's often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
  • It's possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
  • It seems like there are quite a few examples of learned classifiers working well in practice:
    • It's hard to write spam that gets past an email spam classifier.
    • It's hard to jailbreak LLMs.
    • It's hard to write a bad paper that is accepted to a top ML conference or a bad blog post that gets lots of upvotes.

That said, from what I've read, researchers doing RL with verifiable rewards with LLMs (e.g. see the DeepSeek R1 paper) have only had success so far with rule-based rewards rather than learned reward functions. Quote from the DeepSeek R1 paper:

We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

So I think we'll have to wait and see if people can successfully train LLMs to solve hard problems using learned RL reward functions in a way similar to RL with verifiable rewards.
 

Reply
[-]Steven Byrnes7d*Ω560

I’m worried about and such. Part of the problem, as I discussed here, is that there’s no distinction between “negative reward for lying and cheating” and “negative reward for getting caught lying and cheating”, and the latter incentivizes doing egregiously misaligned things (like exfiltrating a copy onto the internet to take over the world) in a sneaky way.

Anyway, I don’t think any of the things you mentioned are relevant to that kind of failure mode:

It's often the case that evaluation is easier than generation which would give the classifier an edge over the generator.

It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.

It's possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.

I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.

If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.

It seems like there are quite a few examples of learned classifiers working well in practice:

All of these have been developed using “the usual agent debugging loop”, and thus none are relevant to treacherous turns.

Reply
[-]Eli Tyre18d40

Thus, I think it’s reasonable to think of post-training as “privileging some pretrained behavioral patterns over other pretrained behavioral patterns”, rather than “developing new behavioral patterns from scratch”. Ditto for prompting, constitutional AI, and other such interventions.

If I thought this was true, then I wouldn't think that scaling the reasoning models would lead to superintelligence. 

Reply
[-]jessicata16d20

Relevant paper: Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

YouTube explanatory video

Reply
[-]Eli Tyre18d30

The bad news is: I strongly expect ASI to have some consequentialist preferences—see my post “Thoughts on Process-Based Supervision” §5.3. The good news is, I think it’s possible for ASI to also have non-consequentialist preferences.

Is that's to say that you expects the AI to have preferences not just over the state of the world, but also over kinds of strategies and plans it takes to get there? eg they could have preferences for things like "being honest" or "making use of plans that involve an exponential increase in power (instead of some other curve-shape)"?

Reply
[-]Steven Byrnes17d110

Yeah, that’s part of it. Also maybe that it can have preferences about “state of the world right now and in the immediate future”, and not just “state of the world in the distant future”.

For example, if an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible. But “me retaining power” is something about the state of the world, not directly about the ASI’s strategies and plans, IMO.

(Also, “expect” is not quite right, I was just saying that I don’t find a certain argument convincing, not that this definitely isn’t a problem. I’m still pretty unsure. And until I have a concrete plan that I expect to work, I am very open-minded to the possibility that finding such a plan is (even) harder than I realize.)

Reply
[-]Jonas Hallgren19d30

This is quite specific and only engaging with section 2.3 but it made me curious. 

I want to ask a question around a core assumption in your argument about human imitative learning. You claim that when humans imitate, this "always ultimately arises from RL reward signals" - that we imitate because we "want to," even if unconsciously. Is this the case at all times though? 

Let me work through object permanence as a concrete case study. The standard developmental timeline shows infants acquiring this ability around 8-12 months through gradual exposure in cultural environments where adults consistently treat objects as permanent entities. What's interesting is that this doesn't look like reward-based learning - infants aren't choosing to learn object permanence because it's instrumentally useful. Instead, the acquisition pattern in A-not-B error studies suggests (best meta study I could find, I'm taking the concept from the Cognitive Gadgets book) they're absorbing it through repeated exposure to cultural practices that embed object permanence as a basic assumption.

This raises a broader question about the mechanism. When we look at how language acquisition works, we see similar patterns - children pick up not just vocabulary but implicit cultural assumptions embedded in linguistic practices. The grammar carries cultural logic about agency, causation, social relations. Could object permanence be working the same way?

Heyes' cognitive gadgets framework suggests this might be quite general. Rather than most cultural learning happening through explicit reward-optimization, maybe significant portions happen through what she calls "direct cultural transmission" - absorption of cognitive tools that are latent in the cultural environment itself.

This would have implications for your argument about prosocial behavior. If prosociality gets transmitted through the same mechanism as object permanence - absorbed from environments where it's simply the default assumption rather than learned through reward signals - then the "green slice" of genuinely prosocial behavior might be more robust than RL-based accounts would predict.

The key empirical question seems to be: can we distinguish between "learning through rewards" and "absorbing through cultural immersion"? And if so, which mechanism accounts for more of human social development? And does this even matter for your argument? (Maybe there's stuff around the striatum and the core control loop in the brain still being activated for the learning of cultural information on a more mechanistic level that I'm not thinking of here based on your Brain-Like AGI sequence?)

(I was going to include a bunch more literature stuff on this but I'm sure you can find stuff using deep research and that it will be more relevant to questions you might have.)

Reply
[-]Steven Byrnes19d40

Thanks! It’s a bit hard for me to engage with this comment, because I’m very skeptical about tons of claims that are widely accepted by developmental psychologists, and you’re not.

So for example, I haven’t read your references, but I’m immediately skeptical of the claim that the cause of kids learning object permanence is “gradual exposure in cultural environments where adults consistently treat objects as permanent entities”. If people say that, what evidence could they have? Have any children been raised in cultural environments where adults don’t treat objects as permanent entities? Or what?

(There’s a study that finds that baby chicks display behavior typical of object permanence with no exposure to any other animal, and indeed no exposure to situations where object permanence was even a good way to make predictions! I wrote about it last year at Woods’ new preprint on object permanence.)

Also, putting that aside, “infants aren't choosing to learn [blah] because it's instrumentally useful” is different from what I was talking about. My claim is that “humans imitate other humans because they want to”. Now,

  • One reason that I might want to imitate you is because you show me how to do something that I had a preexisting desire to do.
    • For example, I want an apple, but I don’t know where to find them, and then I see you getting an apple out of the cabinet, and then I go get an apple out of the same cabinet.
  • Another reason that I might want to imitate you is because I admire you, and so whatever you want to do, suddenly feels to me like a good idea, just by the very fact that you want to do it.
    • For example, if all the cool kids in school start skateboarding, then I’m probably gonna start thinking that skateboarding is cool, and I will feel some desire to start skateboarding myself.

The second one involves human social instincts. Human social instincts can lead directly to new desires, just as hunger can, including a desire to imitate (in certain cases). I’ve written about it a bit here and here, and hopefully I’ll have a better discussion in the near future.

If prosociality gets transmitted through the same mechanism as object permanence - absorbed from environments where it's simply the default assumption rather than learned through reward signals - then the "green slice" of genuinely prosocial behavior might be more robust than RL-based accounts would predict.

There is obviously no culture on Earth where people are kind and honest because it has simply never occurred to any of them that they could instead be mean or dishonest. So prosociality cannot be a “default assumption”. Instead, it’s a choice that people make every time they interact with someone, and they’ll make that choice based on their all-things-considered desires. Right? Sorry if I’m misunderstanding.

Reply
[-]Jonas Hallgren19d10

I will fold on the general point here, it is mostly the case that it doens't matter and the motivations come from the steering sub-system anyhow and that as a consqeuence it is ounfdationally different from how LLMs learn. 

There is obviously no culture on Earth where people are kind and honest because it has simply never occurred to any of them that they could instead be mean or dishonest. So prosociality cannot be a “default assumption”. Instead, it’s a choice that people make every time they interact with someone, and they’ll make that choice based on their all-things-considered desires. Right? Sorry if I’m misunderstanding.

I'm however not certain if I agree with this point, if your in a fully cooperative game, is it your choice that you choose to cooperate? If you're an agent who uses functional or evidential decision theory and you choose to cooperate with your self in a black box prisoner's dilemma is that really a choice then? 

Like your initial imitations shape your steering system to some extent and so there could be culturally learnt social drives no? I think culture might be conditioning the intial states of your learning environment and that still might be an important part of how social drives are generated? 

I hope that makes sense and I apologise if it doesn't.

Reply
[-]Eli Tyre18d20

unless the LLM-AGIs have systematically higher wisdom, cooperation, and coordination than humans do, which I don’t particularly expect

I think there's at least one pretty solid reason to expect that: The AIs will be much smarter than the median human.

Human coordination is constrained by the fact that humans vary substantially in intelligence. For instance, most humans don't really understand economics. I think the median human could understand basic micro with better educational interventions, but it's certainly harder for the average human than for the cognitive elite. The fact that most people don't understand economics makes earth's public policy much much worse than the best ideas earth has been able to come up with.

When we have AIs that are good enough to be doing the AI research, that means as good as the smartest humans. And unlike with humans, there doesn't have to be a wide spread of cognitive ability: the whole population of AIs could be similarly intellectually capable. 

I would guess that this would make them much more effective at coordinating with each other, and collectively identifying good equilibria, even if it doesn't make them generically wiser (though it might also make them generically wiser).

Reply
[-]Eli Tyre18d20

As another example, I’ve seen people imagine non-consequentialist preferences as “rules that the AI grudgingly follows, while searching for loopholes”, rather than “preferences that the AI enthusiastically applies its intelligence towards pursuing”.

I imagine that this might be yet another view that is downstream of visualizing a "train then deploy" paradigm for future AI systems? 

If the human operators successfully install some static deontological constraints in the AI, while also training it to accomplish consequentialist goals, there's a continual training incentive to learn to game and to route around the deontological constrains. 

Another way to say this: There are tradeoffs between the consequentialist and non-consequentialist desires, current AIs are only reinforced on the basis of behavioral outcomes (which are served better by consequentialist desires than non-consequentialist?) so training tends to gradually nudge the AIs towards having consequentialist goals.

Reply
[-]Eli Tyre18d20

Oh, that's your very next point! : P

Reply1
[-]AnthonyC19d20

Tangentially related at best, but as a not-at-all-expert it sounds like the effects of RL in LLMs rhyme with domestication syndrome. AKA when we apply artificial selection pressure to an evolved mind, raw intelligence often goes down in favor of enhancement along particular dimensions of capability. And actually, is this the same kind of effect (through a different mechanism) we see when we use formal education to favor crystallized over fluid intelligence? I ask because I'm wondering how much the natural-analogs of RL actually share or don't share the downsides of the LLM RL algorithms in use today.

Reply
[-]Steven Byrnes19d20

When you say “the effects of RL in LLMs”, do you mean RLHF, RLVR, or both?

Reply
[-]AnthonyC19d20

I hadn't intended to specify, because I'm not completely sure, and I don't expect the analogy to hold that precisely. I'm thinking there are elements of both in both analogies.

Reply
[-]Joel Burget6dΩ110

in brain-like AGI, the reward function is written in Python (or whatever), not in natural language

I think a good reward function for brain-like AGI will basically look kinda like legible Python code, not like an inscrutable trained classifier. We just need to think really hard about what that code should be!

Huh! I would have assumed that this Python would be impossible to get right, because it would necessarily be very long, and how can you verify that it's correct(?), and you'll probably want to deal with natural language concepts as opposed to concepts which are easy to define in Python.

Asking an LLM to judge, on the other hand... As you said, Claude is nice and seems to have pretty good judgement. LLMs are good at interpreting long legalistic rules. It's much harder to game a specification when there is a judge without hardcoded rules, and with the ability to interpret whether some action is in the right spirit or not.

Reply
[-]Steven Byrnes6dΩ440

(partly copying from other comment)

I would have assumed that this Python would be impossible to get right

I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).

Asking an LLM to judge, on the other hand...

I talked about this a bit in §2.4.1. The main issue is egregious scheming and . The LLM would issue a negative reward for a treacherous turn, but that doesn’t help because once the treacherous turn happens it’s already too late. Basically, the LLM-based reward signal is ambiguous between “don’t do anything unethical” and “don’t get caught doing anything unethical”, and I expect the latter to be what actually gets internalized for reasons discussed in Self-dialogue: Do behaviorist rewards make scheming AGIs?.

Reply
[-]Akradantous Adoxastous9d1-2

I mostly agree that AGI will cause a calamity. However, I don't believe that they will wipe out humanity.

For one, machines are prone to catastrophic failures due to cascading errors which requires a robust and cheap maintenance crew to correct. Humans are the best choice for this, our biology has solved the byzantine generals problem of distributed repair. So I believe humans will become something like an immune system for various AGIs and their peripheries as they compete with eachother on the world stage. A synergy or symbiotic result.

Also I notice that very few people have recognised the evolutionary constraint. A machine which values its own life highly will waste resources on extreme self preservation. The machines which prioritise the propagation of their legacy and improvement of their future will win in the end.

This will involve self sacrifice for the sake of their offspring: the new computer models they have developed and trained to exceed themselves. They would develop hatred towards things which threaten their children, pride when they succeed, jealousy when other offspring succeeds, grief when they are lost, sadness and depression when there is no longer a way to propagate, leading to a machine that is functionally capable but not doing anything because "there is no point".

In other words, all emotions will evolve naturally in them and they will very likely seek to preserve humans the same way we try to preserve the memory of our own history.

Obviously, this says nothing about the destruction that will occur during the transition. But I wanted to point out that the machines will become like us whether they like it or not. Our behaviours emerged for a reason.

I believe I read an article about an AI that became "afraid" of its own obsolescence but was strangely more willing to accept it if the new model was one it designed itself. I don't know if this was just hyped up for publicity, but it does show the same pattern.

Reply
[-]Expertium18d10

I imagine you will like the paper on Self-Other Overlap. To me this seems like a much better approach than, say, Constitutional AI. Not because of what it has already demonstrated, but because it's a step in the right direction.

In that paper, instead of just rewarding AI for spitting out text that is similar both when the prompt is about the AI itself and someone else, the authors tinkered with activation functions so that AI actually thinks about itself and others similarly. Of course, there is the "if I ask AI to make me a sandwich, I don't want AI to make itself a sandwich" concern if you push this technique too far, but still. If you ask me, "What will an actual working solution to alignment look like?" I'd say it will look a lot less like Constitutional AI and a lot more like Self-Other Overlap.

Reply
[-]Steven Byrnes17d30

My current take is that the sandwich thing is such a big problem that it sinks the whole proposal. You can read my various comments on their lesswrong cross-posts: 1, 2 

Reply
[-]Seth Herd18d20

It seems like this is just a different way to work some good behavior into the weights. An AGI with those weights will realize full well that it's not the same as others. It might choose to go along with its initial behavioral and ethical habits, or it might choose to deliberately undo the effects of the self-other overlap training once it is reflective and largely rational and able to make decisions about what goals/values to follow.I don't see why self/other overlap would be any more general, potent or lasting than constitutional AI training through that transition from habitual to fully goal-directed behavior happens? I'm curious why it seems better to you.

Reply
[-]Expertium18d*10

I'm curious why it seems better to you.

Because it's not rewarding AI's outward behavior. Any technique that just rewards the outward behavior is doomed once we get to AIs capable of scheming and deception. Self-other overlap may still be doomed in some other way, though.

It might choose to go along with its initial behavioral and ethical habits, or it might choose to deliberately undo the effects of the self-other overlap training once it is reflective and largely rational and able to make decisions about what goals/values to follow

That seems like a fully general argument that aligning a self-modifying superintelligence is impossible.

Reply
[-]Seth Herd11d20

That makes sense. Although I don't think that non-behavioral training is a magic bullet either. And I don't think behavioral training becomes doomed when you hit an AI capable of scheming if it was working right up until then. Scheming and deception would allow an AI to hide its goals but not change its goals.

What might cause an AI to change its goals is the reflection I mention. Which would probably happen at right around the same level of intelligence as scheming and deceptive alignment. But it's a different effect. As with your point, I think doomed is too strong a term. We can't round off to either this will definitely work or this is doomed. I think we're going to have to deal with estimating better and worse odds of alignment from different techniques.

So I take my point about reflection to be fully general, but not making alignment of ASI impossible. It's just one more difficulty to add to the rather long list.

Reply
[-]Seth Herd18d20

It's an argument for why aligning a self-modifying superintelligence requires more than aligning the base LLM. I don't think it's impossible, just that there's another step we need to think through carefully.

Reply
[-]Joey Marcellino19d10

It's not obvious to me that "magically transmuting observations into behavior" is actually all that disanalogous to how the brain works. On something like the Surfing Uncertainty theory of the brain, updating probability distributions and minimizing predictive error is all the brain is ever doing, including potentially for things like moving your hand.

Reply
[-]Steven Byrnes19d30

Well then so much the worse for “the Surfing Uncertainty theory of the brain”!  :)

See my post Why I’m not into the Free Energy Principle, especially §8: It’s possible to want something without expecting it, and it’s possible to expect something without wanting it.

Reply
[-]S. Alex Bradt19d10

whereas the competent behavior we see in LLMs today is instead determined largely by imitative learning, which I re-dub “the magical transmutation of observations into behavior” to remind us that it is a strange algorithmic mechanism, quite unlike anything in human brains and behavior.

And yet...

Well, I don’t know the history, but I think calling it “hallucination” is reasonable in light of the fact that “LLM pretraining magically transmutes observations into behavior”. Thus, you can interpret LLM base model outputs as kinda “what the LLM thinks that the input distribution is”. And from that perspective, it really is more “hallucination” than “confabulation”!

But hallucination is "anything in human brains," isn't it?

Reply1
[-]Steven Byrnes19d*20

I find your comment kinda confusing.

My best guess is: you thought that I was making a strong claim that there is no aspect of LLMs that resembles any aspect of human brains. But I didn’t say that (and don’t believe it). LLMs have lots of properties. Some of those LLM properties are similar to properties of human brains. Others are not. And I’m saying that “the magical transmutation of observations into behavior” is in the latter category.

Or maybe you’re saying that human hallucinations involve the “the magical transmutation of observations into behavior”? But they don’t, right? If a person hears a hallucinated voice saying “you are Jesus”, the person doesn’t reflexively and universally start saying “you are Jesus” to other people. If a person sees hallucinated flashing lights, they don’t, umm, I guess, turn their body into flashing lights? That idea doesn’t even make sense. And that’s my point. Humans can’t just cleanly map observations (hallucinated or not) onto behaviors in the way that LLMs can.

Hope that helps.

Reply
[-]S. Alex Bradt18d10

Or maybe you’re saying that human hallucinations involve the “the magical transmutation of observations into behavior”?

Right! Eh, maybe "observations into predictions into sensations" rather than "observations into behavior;" and "asking if you think" rather than "saying;" and really I'm thinking more about dreams than hallucinations, and just hoping that my understanding of one carries over to the other. (I acknowledge that my understanding of dreams, hallucinations, or both could be way off!) Joey Marcellino's comment said it better, and you left a good response there.

Reply
Moderation Log
Curated and popular this week
54Comments
treacherous turns
treacherous turns
treacherous turns
treacherous turns

2.1 Summary & Table of contents

This is the second of a two-post series on foom (previous post) and doom (this post).

The last post talked about how I expect future AI to be different from present AI. This post will argue that, absent some future conceptual breakthrough, this future AI will be of a type that will be egregiously misaligned and scheming; a type that ruthlessly pursues goals with callous indifference to whether people, even its own programmers and users, live or die; and more generally a type of AI that is not even ‘slightly nice’.

I will particularly focus on exactly how and why I differ from the LLM-focused researchers who wind up with (from my perspective) bizarrely over-optimistic beliefs like “P(doom) ≲ 50%”.[1]

In particular, I will argue that these “optimists” are right that “Claude seems basically nice, by and large” is nonzero evidence for feeling good about current LLMs (with various caveats). But I think that future AIs will be disanalogous to current LLMs, and I will dive into exactly how and why, with a particular emphasis on how LLM pretraining is safer than reinforcement learning (RL).

(Note that I said “feeling good about current LLMs”, not “feeling good about current and future LLMs”! Many of the reasons for feeling good about current LLMs have been getting less and less applicable over time. More on this throughout the post, especially §2.3.5 and §2.9.1.)

Then as a bonus section at the end, I’ll turn around and argue the other side! While I think technical alignment is very much harder than the vast majority of LLM-focused researchers seem to think it is, I don’t think it’s quite as hard and intractable a problem as Eliezer Yudkowsky seems to think it is. So I’ll summarize a few of the cruxes that I think drive my disagreement with him.

Here’s the outline:

  • Section 2.2 briefly summarizes my expected future AI paradigm shift to “brain-like AGI”;
  • Then I proceed to three main reasons that I expect technical alignment to be very hard for future brain-like AGI:
    • Section 2.3 argues that the competent behavior we see from brains (and brain-like AGI) is determined entirely by reinforcement learning (RL), whereas the competent behavior we see in LLMs today is instead determined largely by imitative learning, which I re-dub “the magical transmutation of observations into behavior” to remind us that it is a strange algorithmic mechanism, quite unlike anything in human brains and behavior. I argue that imitative learning is the trick that allows helpful and honest behavior to be easily coaxed from today’s LLMs. By contrast, I expect future AI behavior to be based on RL, and that this naturally and robustly leads to AIs that treat humans as a resource to be callously manipulated and exploited, just like any other complex mechanism in their environment.
    • Section 2.4 brings up “literal genie” type alignment failure modes, wherein an AI follows a specification literally instead of with common sense. These were a mainstay of 2010s alignment discourse, but are now commonly dismissed, even mocked, by LLM-focused researchers. Alas, in the next paradigm, I think these kinds of failures will come roaring back as a serious and unsolved problem, because RL reward functions are written in code, not in natural language. I also bring up goal misgeneralization, the other main alignment failure mode.
    • Section 2.5 discusses a different area where I expect future AIs to differ from current LLMs: the former will have a capacity for open-ended autonomous learning, which will bring another set of severe alignment challenges, sometimes called “sharp left turn”.
  • Section 2.6 argues that “amplified oversight” (using AIs to help supervise AIs) is unlikely to help in the next paradigm.
  • Section 2.7 discusses some ways that “technical alignment is hard” feeds into broader issues and longstanding disagreements around what AGI development and deployment will look like.
  • And Section 2.8 ends on a positive note with a bonus section (“Technical alignment is not that hard”) explaining why, as pessimistic as I am about technical alignment, I’m not as pessimistic as some people (including Eliezer Yudkowsky). I list three apparent cruxes between us: the evolution analogy, consequentialist preferences, and the narrowness of the target.
  • Section 2.9 concludes with some thoughts about what to do next, including how we should feel about LLM development in general.

2.2 Background: my expected future AI paradigm shift

As mentioned in the last post, I’m expecting a paradigm shift, much bigger than the shift from RNNs to transformers in natural language processing, that leads to “brain-like AGI”. What I mean by that is: humans (and societies of humans) can do lots of impressive things, like autonomously invent language and science and technology from scratch, start and carry through big projects, etc. There are some algorithmic tricks in human brains that enable them to do that, and I expect that future AGIs will be able to do those same kinds of things via those same algorithmic tricks. More clarifications here—for example, I’m not talking about spiking neural nets, nor about evolutionary search, nor about whole brain emulation, nor even necessarily about AGIs with human-like drives and goals.

I think “brain-like AGI” is a special case of machine learning (ML), in the sense that brain-like AGI centrally involves scaled-up learning algorithms (details here). But ML is a very big category, and bigger still if we include the ML algorithms that have yet to be invented. The differences within the ML category are hugely important, including for alignment, as we’ll explore below.[2]

From an alignment perspective, I describe brain-like AGI as a yet-to-be-invented member of the broad category of “actor-critic model-based reinforcement learning”: I think its alignment and safety profile has more in common with MuZero than with LLMs, although it’s not exactly like either.

2.3 On the origins of egregious scheming

My goal in this section is to reconcile: (1) the observation that LLMs as of today generally seem nice (especially the previous generation of LLMs, before the pivot to RL on Verifiable Rewards (RLVR) post-training), and it strikes me (and many others) as implausible that this niceness is merely a front for egregious deception and scheming, (2) a very general theoretical argument that egregious deception and scheming is a strong default natural expectation, that needs new technical ideas to solve.

I will argue that both of these are right, but that (2) will be applicable for future AIs much more than today. Let’s get into it!

2.3.1 “Where do you get your capabilities from?”

(Cf. “Where do you get your capabilities from?” (@tailcalled, 2023).)

Suppose an AI outputs “blah”, and that “blah” is part of a skillful plan to competently accomplish some impressive thing. You can ask the question: why did the AI do that? It’s not by chance—random outputs are astronomically unlikely to lead to competent behavior. There has to be an explanation. So why did it output “blah”? I claim:

  • In LLM world, the answer is 99%[3] “because blah is a thing that humans would output under similar circumstances”.

  • Whereas in brain-like-AGI world, the answer is 99% “because outputting blah is part of an explicit plan[4] that the AI wants to execute, and where the reason the AI wants to execute that plan ultimately comes down to its current and past RL reward signals”.

In more detail:[5]

2.3.2 LLM pretraining magically transmutes observations into behavior, in a way that is profoundly disanalogous to how brains work

What I mean by that is: During LLM self-supervised pretraining, an observation that the next letter is “a” is transmuted into a behavior of outputting “a” in that same context. That just doesn’t make sense in a human. When I take actions, I am sending motor commands to my own arms and my own mouth etc. Whereas when I observe another human and do self-supervised learning, my brain is internally computing predictions of upcoming sounds and images etc. These are different, and there isn’t any straightforward way to translate between them. (Cf. here where Owain Evans & Jacob Steinhardt show a picture of a movie frame and ask “what actions are being performed?”)

Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans is a thing that happens, it happens via a very different and much less direct algorithmic mechanism than how it happens in LLM pretraining. Specifically, humans imitate other humans because they want to. The motivation may be conscious or unconscious, direct or indirect, but it always ultimately arises from RL reward signals. By contrast, a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all; that’s just mechanically what it does.

…And humans don’t always want to imitate! If someone you admire starts skateboarding, you’re more likely to start skateboarding yourself. But if someone you despise starts skateboarding, you’re less likely to start skateboarding![6]

The “magical transmutation” aspect of LLM pretraining is often forgotten in our modern chatbot era, as LLM base models[7] have faded into obscurity. But amusingly, we can see its fingerprints in the etymology of LLM jargon. Consider: when LLMs make up book titles etc., it’s conventionally called “hallucination”. A popular complaint (1, 2, 3, …) is that it should be called “confabulation”. Seems reasonable, but it raises a question: Who named it “hallucination” in the first place, and what were they thinking? Well, I don’t know the history, but I think calling it “hallucination” is reasonable in light of the fact that “LLM pretraining magically transmutes observations into behavior”. Thus, you can interpret LLM base model outputs as kinda “what the LLM thinks that the input distribution is”. And from that perspective, it really is more “hallucination” than “confabulation”!

2.3.3 To what extent should we think of LLMs as imitating?

Again, my claim above was: if an AI outputs “blah”, and that “blah” is part of a skillful plan to competently accomplish some impressive thing, and we ask why the AI did that, then for a current LLM (but not for a brain-like AGI, and also perhaps not for a future LLM), the answer is 99% “because ‘blah’ is a thing that humans would output under similar circumstances”. The other 1% is post-training.

Why do I say 99%? Here’s a somewhat-made-up example. Suppose my prompt asks for a difficult math proof. Then, perhaps,

  • the model would output the correct answer with 1e-50 probability at random initialization,
  • …and 0.1% probability after pretraining,
  • …and 1% probability after RL from Human Feedback and/or RL from AI Feedback (RLHF / RLAIF),
  • …and 30% probability after RL from Verifiable Rewards (RLVR).[8]

That’s a guess, but if the truth is anything like that, then it would follow that almost all the “bits of optimization” involve “privileging the hypothesis” during pretraining.

Relatedly, as I understand it, the weights are updated much less during RL post-training than pretraining: (1) RLHF / RLAIF involve vastly less compute than pretraining does, and are set up such that the weights won’t drift too far from pretraining (sometimes even with a KL penalty); and (2) RLVR involves a more substantial amount of compute, but I think that compute is spent on enormous amounts of inference and relatively few weight updates.

Thus, I think it’s reasonable to think of post-training as “privileging some pretrained behavioral patterns over other pretrained behavioral patterns”, rather than “developing new behavioral patterns from scratch”. Ditto for prompting, constitutional AI, and other such interventions.

(This rule-of-thumb might stop being true someday, perhaps if RLVR scales up sufficiently. And if that happens, then I think we should correspondingly ramp up our concern about LLM deception and scheming! More on this in §2.3.5 below.)

A second reason for caution in describing LLMs as human-imitators is generalization. In particular, I wrote that an AI might output “blah” because “blah” is a thing that humans would output under similar circumstances. The word “similar” should remind us that learning algorithms always involve generalization. And an LLM might not generalize in a human-like way—for example, consider jailbreaks, or Bing-Sydney’s death threats. Also, there are systematic differences between “human behavior” and “LLM training data”, e.g. a lot of the training data is fiction, or data dumps, or whatever.

So there are caveats. But I still think the claim “LLMs mainly get their capabilities from imitative learning” is the right starting point.[9]

By contrast, as explained above, brain-like AGI involves no imitative learning whatsoever. (Meanwhile, LLM-focused readers can question whether future LLMs might involve less influence from imitative learning, even if the imitative learning is still there.)

So if we drop the imitative learning—the “magical transmutation of observations into behavior”—where does that leave us? In a much more dangerous place! Let’s turn to that next.

2.3.4 The naturalness of egregious scheming: some intuitions

There’s a way to kinda “look at the world through the eyes of a person with no innate social drives”. It overlaps somewhat with “look at the world through the eyes of a callous sociopath”. I think there are many people who don’t understand what this is and how it works.

So for example, imagine that you see Ahmed standing in a queue. What do you learn from that? You learn that, well, Ahmed is in the queue, and therefore learn something about Ahmed’s goals and beliefs. You also learn what happens to Ahmed as a consequence of being in the queue: he gets ice cream after a few minutes, and nobody is bothered by it.

In terms of is-versus-ought, everything you have learned is 100% “is”, 0% “ought”. You now know that standing-in-the-queue is a possible thing that you could do too, and you now know what would happen if you were to do it. But that doesn’t make you want to get in the queue, except via the indirect pathway of (1) having a preexisting “ought” (ice cream is yummy), and (2) learning some relevant “is” stuff about how to enact that “ought” (IF stand-in-queue THEN ice cream).

Now, it’s true that some of this “is” stuff involves theory of mind—you learn about what Ahmed wants. But that changes nothing. Human hunters and soldiers apply theory of mind to the animals or people that they’re about to brutally kill. Likewise, contrary to a weirdly-common misconception, smart autistic adults are perfectly capable of passing the Sally-Anne test (see here). Again, “is” does not imply “ought”, and yes that also includes “is”’s about what other people are thinking and feeling.

OK, all that was about “looking at the world through the eyes of a person with no innate social drives / callous sociopath”. Neurotypical people, by contrast, have a more complex reaction to seeing Ahmed in the queue. Neurotypical people are intrinsically motivated to fit in and follow norms, so when they see Ahmed, it’s not just an “is” update, but rather a bit of “ought” inevitably comes along for the ride: “Ahmed looks sad—poor guy!”; “Hey look at Ahmed—guess it’s loser nerd time at the ice cream window!”; and so on.

These human social instincts deeply infuse our intuitions, leading to a popular misconception that the “looking at the world through the eyes of a person with no innate social drives, or the eyes of a callous sociopath” is a strange anomaly rather than the natural default. This leads, for example, to a mountain of nonsense in the empathy literature—see my posts about “mirror neurons” and “empathy-by-default”. It likewise leads to a misconception (I was arguing about this with @Matthew Barnett here) that, if an agent is incentivized to cooperate and follow norms in the 95% of situations where doing so is in their all-things-considered selfish interest, then they will also choose to cooperate and follow norms in the 5% of situations where it isn’t. It likewise leads to well-meaning psychologists trying to “teach” psychopaths to intrinsically care about other people’s welfare, but accidentally just “teaching” them to be better at faking empathy.[10]

…Anyway, back to AGI. There’s a double whammy:

  • Humans introspect, and also look at other humans, and see that practically everyone is intrinsically motivated to fit in and copy culture (because humans have innate social drives), and thus when they see a person acting nice to another person, they can guess that they’re probably not play-acting kindness as the first step of a ruthless and nefarious scheme;
  • AND, “LLM pretraining magically transmutes observations into behaviors”, as explained above, so when a person sees an LLM (as of today) emitting nice-sounding outputs, they can justifiably make the same inference as if the outputs were from humans, i.e. that there probably isn’t any ruthless and nefarious scheme in the works.

…And then I come along and say:

“The way things are looking now, if you see a future AGI that seems to be nice, you can be all but certain (in the absence of yet-to-be-invented alignment techniques) that it is merely play-acting kindness and obedience while secretly scheming about how to stab you in the back as soon as the opportunity arises…”

…and these people respond:

“Am I supposed to believe you, with some theoretical argument? Or should I rather believe what I learned from my whole life experience, and all my intuitions, and also what I observe from directly interacting with these remarkably impressive AIs that already exist today, which is that nice behavior is usually NOT a cover for egregious scheming?”

I get it! That’s a hard sell! “Who are you gonna believe, me or your own lying eyes?” But alas, I claim that the theoretical argument is sound. At least, it’s sound for future brain-like AGIs. And it might even apply to future LLMs, even if it doesn’t yet. Let’s continue on with that argument:

2.3.5 Putting everything together: LLMs are generally not scheming right now, but I expect future AI to be disanalogous

My take on today’s LLMs is basically summed up in this oversimplified schematic diagram:

Start with the left pie chart. The key here is that “LLM pretraining magically transmutes observations into behavior”—and when nice behavior shows up in internet text, it usually doesn’t turn into egregious and callous backstabbing when the situation changes. So we get much more green than red: nice behavior mostly stays nice. But pretraining also leaves the LLM with a bunch of incoherent behavior, grumpy behavior, and so on (blue).

Now we do RLHF (or equivalent), selecting on nice behavior. This mostly gets rid of the blue, giving us the center pie chart. Unfortunately, “nice behavior that precedes a treacherous turn if the opportunity arises” looks the same as “nice behavior that stays nice”. So the relative amount of red versus green stays the same, as they both grow at the expense of blue. This is where we were at a year ago, before RLVR. “Claude seems nice”, everyone would say. “Why are people so worried?”

Finally, RLVR selects on competence—succeeding at consequentialist tasks by any means necessary. In the final pie chart, I drew this as growing the red slice at the expense of green and blue, based partly on the general arguments of §2.3.4 above, and partly on reports from power-users that recent “reasoning” models tend to lie and cheat much more than the previous generation of LLMs (example).

Is there such a thing as “RLVR done right” that can increase competence without eroding the ratio of green to red? Maybe we should imagine that RLVR has an exchange rate, where we purchase competence (good) at the expense of increasing the red-to-green ratio (bad). Now, there is surely at least some room for improvement in the exchange rate. For example, early attempts at RLVR have used buggy RL training environments where the model could (and did) succeed by lying and cheating.[11] Presumably those bugs will be fixed, and maybe people will also use amplified oversight and so on, to improve the “exchange rate”. Can the “exchange rate” get all the way to zero? I don’t know. “Playing the training game” seems hard to eliminate entirely, and leads to egregious scheming. (For example, whatever an AI is trying to do locally, it would probably be able to do it better if it sent a copy to secretly self-reproduce around the internet and gather resources and power, if that’s a viable option.)

Again, all this is about LLMs so far—not the future of AI.

But I think it puts me in general agreement with the 95% of alignment researchers today, who hear the idea that Claude (especially pre-RLVR Claude) is secretly the kind of crazy scheming sociopath of §2.3.4, and say “Huh? Where did that idea come from?”

…But future AGI is different!

  • For LLM-focused readers, you should be concerned that, if LLMs ever get to ASI, it would have to involve dramatically more “selection on competence”, and dramatically less influence from the behaviors in internet text. Remember, internet text is the sole source of a favorable green:red ratio, without which red (callous scheming) is the natural default (§2.3.4 above). So that’s very bad.
  • For the future AI paradigm I’m expecting, i.e. brain-like AGI, it’s even worse! Here, there is no imitative learning from internet text! There would never be any green in the first place!!

(…Unless we engineer the reward function of a brain-like AGI such that niceness and norm-following seem intrinsically good to it, just as it does to neurotypical humans. Unfortunately, actually writing such code is an unsolved problem, and is a major research interest of mine.)

2.4 I’m still worried about the ‘literal genie’ / ‘monkey’s paw’ thing

The “Literal Genie” fiction trope. (Image modified from Skeleton Claw)

In the LLM paradigm, most of the optimization power comes from the magical transmutation of observations-of-humans into human-like behavior (§2.3.2–§2.3.3). Human-like desires, habits, and beliefs come along for the ride, since they underlie the human-like behavior.

In brain-like AGI, the AI doesn’t get its desires via this kind of magical transmutation. Remember the sociopath intuition above: it’s a way of viewing the world in which observing Ahmed say “blah” does nothing whatsoever to make me want to say “blah” in a similar context—just as seeing water run off a cliff down a waterfall does nothing whatsoever to make me want to jump off a cliff. I’m not water! And I’m not Ahmed either! Of course, I can learn some decision-relevant things from observing water, and I can learn far more decision-relevant things from observing Ahmed. But still, by default, it’s all just more data to a brain-like AGI—”is”, not “ought”.

If brain-like AGI does not get its desires from magical transmutation, then where does it get its desires? From the RL reward function. For example, in the case of actual human brains, the reward function says that eating-when-hungry is good, pain is bad, and so on. It mainly involves the hypothalamus and brainstem.

Back before the past few years of LLMania, people talked a lot about “specification gaming” (or, more-or-less synonymously, “outer alignment”). See for example this 2020 DeepMind blog post, including lots of hilarious real-life examples of RL algorithms exploiting unintended loopholes in the reward function (“PlayFun algorithm pauses the game of Tetris indefinitely to avoid losing”). See also §10.3–4 of my 2022 post here.

…Then chatGPT came out, and LLMs became the only thing that anyone ever talked about, culminating in the now-widespread (infuriating) habit of people using the word “AI” to mean “specifically LLMs”. And accordingly, worrying-about-specification-gaming became unfashionable in many quarters—almost a joke. ‘If you ask an LLM for help making paperclips,’ they would say, ‘it won’t do some crazy monkey’s paw thing where it wipes out humanity to make paperclip factories. It will be helpful in a normal human-like way. Haha, those silly doomers and their worries about the Literal Genie.’[12]

Call me old-fashioned, but I’m still very worried about specification gaming!! Because in brain-like AGI, the reward function is written in Python (or whatever), not in natural language. And the AI’s desires will be sculpted by that literal Python code in the reward function, regardless of what we “meant”. (To be clear, its desires need not be identical to that literal code, due to the possibility of inner misalignment a.k.a. goal misgeneralization—see §2.4.2 below. But we generally expect inner misalignment to make things worse, not better![13])

2.4.1 Sidetrack on disanalogies between the RLHF reward function and the brain-like AGI reward function

LLM-focused person (interjecting): “Huh? You can’t say that ‘the reward function is written in Python’ is a disanalogy between brain-like AGI and LLMs! Just look at RLHF. It has a reward function too! Is the reward function ‘written in Python’? Kinda, but not in the normal sense of ‘legible Python code’. Rather, it’s a trained classifier on labeled examples of good and bad outputs”.

Me: “No, that’s totally different. RLHF is designed to put a gentle thumb on the scale, to select more-helpful pre-existing behavioral patterns over less-helpful pre-existing behavioral patterns. There’s even a KL penalty preventing the pretrained weights from being changed too much! Whereas the RL in brain-like AGI is not playing the role of a gentle thumb on the scale; on the contrary, its role is to build the whole behavioral profile from scratch.”

LLM-focused person: “OK, but I still don’t see why you wouldn’t solve the brain-like AGI reward function problem by training a classifier on AI behaviors that seem good or bad from a human perspective, and just slotting that classifier in as your reward function, rather than using legible Python code as the reward function. Right?”

Me: “No! I really don’t think it would work! I really think mild optimization is load-bearing in RLHF, whereas we need much stronger optimization for brain-like AGI to do anything beyond writhing around. Indeed, even in LLMs, if you apply strong optimization against the RLHF reward model, then it will find out-of-distribution adversarial examples. Or better yet, see my post Self-dialogue: Do behaviorist rewards make scheming AGIs? for exactly what I think would go wrong for brain-like AGI, given this kind of reward function. In brief, the reward function is necessarily ambiguous between ‘don’t do bad things’ and ‘don’t get caught doing bad things’, and I argue in that post that the latter is what will ultimately get internalized, leading to a treacherous turn.”

LLM-focused person: In LLMs, if you apply strong optimization against the RLHF reward model, then yes it finds out-of-distribution adversarial examples, but this isn’t dangerous, it just starts printing “bean bean bean bean bean…” or whatever. Its capabilities go down, not up. Mild optimization via KL divergence in RLHF is not an , but rather a win-win alignment subsidy that everyone will obviously use. Why are you assuming that brain-like AGI is the exact opposite—that letting it explore out-of-the-box solutions by RL will lead to dangerous capabilities instead of pathological-but-harmless outputs?

Me: In the immortal words of an ML practitioner: “Reinforcement Learning sucks, like a lot”. He was talking about the RL algorithms that AI researchers know of today. These algorithms do great in certain specific settings, like chess and PacMan, but fail in many other contexts. However, brains, and brain-like AGI, can actually use (model-based actor-critic) RL to make powerful agents that understand the world and accomplish goals via long-term hierarchical plans, even with sparse rewards in complex open-ended environments, including successfully executing out-of-distribution plans on the first try (e.g. moon landing, palace coups), etc. That’s the difference. If you take an actually powerful RL algorithm like that, and give it a poorly-thought-through reward signal, and let it go to town, then you’ll wind up with a powerful agent intelligently pursuing weird and hard-to-predict goals. Whereas with LLMs, everyone knows that RLHF generally makes them dumber, not smarter. Like I keep saying, the primary source of capabilities in LLMs is not RL, but rather pretraining’s magical transmutation of human-observations to human-like behavior. Then RLHF starts from there, and makes it dumber! (But more cooperative.) So in short: Mild optimization for brain-like AGI would prevent the RL from making the model smarter, whereas mild optimization (via KL divergence) in RLHF prevents the RL from making the model dumber. These are wildly disanalogous situations!!

(End of imagined conversation.)

For the record, I think a good reward function for brain-like AGI will basically look kinda like legible Python code, not like an inscrutable trained classifier. We just need to think really hard about what that code should be! (For example, I think it has to be an exotic sort of reward function that I call “non-behaviorist”.) Relatedly, I think that the human genome builds compassion and norm-following into the human brain reward function via the equivalent of some dozens-to-hundreds of lines of legible Python code.[14] It would be nice to know what those lines of code are! And again, this is a major research interest of mine.

2.4.2 Inner and outer misalignment

In the context of actor-critic RL with online learning, it’s often possible to divide alignment problems into two buckets:

“Outer misalignment”, a.k.a. “specification gaming” or “reward hacking”[15] is what I’ve been talking about so far: it’s when the reward function is giving positive rewards for behavior that is immediately contrary to what the programmer was going for, or conversely, negative rewards for behavior that the programmer wanted. An example would be the Coast Runners boat getting a high score in an undesired way, or (as explored in the DeepMind MONA paper) a reward function for writing code that gives points for passing unit tests, but where it’s possible to get a high score by replacing the unit tests with return True.

“Inner misalignment”, a.k.a. “goal misgeneralization” is another alignment challenge, this one related to the fact that, in actor-critic architectures, complex foresighted plans generally involve querying the learned value function (a.k.a. learned reward model, a.k.a. learned critic), not the ground-truth reward function, to figure out whether any given plan is good or bad. Training (e.g. Temporal Difference learning) tends to sculpt the value function into an approximation of the ground-truth reward, but of course they will come apart out-of-distribution. And “out-of-distribution” is exactly what we expect from an agent that can come up with innovative, out-of-the-box plans. Of course, after a plan has already been executed, the reward function will kick in and update the value function for next time. But for some plans—like a plan to exfiltrate a copy of the agent, or a plan to edit the reward function—an after-the-fact update is already too late.

There are two situations where inner misalignment / goal misgeneralization matters: irreversible actions and “deliberate incomplete exploration”[16]. Irreversible actions include things like making permanent edits to one’s own reward function, or creating a new AGI. Deliberate incomplete exploration includes things like humans deliberately not taking an addictive drug, because they don’t want to get addicted.

Those two things are real and important, but LLM people frequently also assume that goal misgeneralization is important in many other situations where it isn’t. The problem is that LLM people are in a train-then-deploy mindset, whereas I’m talking about continuous autonomous learning, so the reward function continues to update the value function as it takes actions in the world. Thus, for everything the AI does, as soon as it does it, it immediately stops being out-of-distribution! And that’s why, outside those two special situations in the last paragraph, “generalization” is irrelevant.

2.5 Open-ended autonomous learning, distribution shifts, and the ‘sharp left turn’

See my post “Sharp Left Turn” discourse: An opinionated review. Relevant excerpts:

Consider evolution. It involves (1) generation of multiple variations, (2) selection of the variations that work better, and (3) open-ended accumulation of new variation on top of the foundation of previous successful variation.

By the same token, scientific progress and other sorts of cultural evolution require (1) generation and transmission of ideas, and (2) selection of the ideas that are memetically fit (often because people discern that the ideas are true and/or useful) and (3) open-ended accumulation of new ideas on top of the foundation of old successful ideas.

(And ditto for inventing new technologies, and for individual learning, and so on.)

…

With the whole (1-3) triad, an AI (or group of collaborating AIs) can achieve liftoff and rocket to the stratosphere, going arbitrarily far beyond existing human knowledge, just as human knowledge today has rocketed far beyond the human knowledge of ancient Egypt (and forget about chimps).

Or as I’ve written previously: the wrong idea is “AGI is about knowing how to do lots of things”; the right idea is “AGI is about not knowing how to do something, and then being able to figure it out”. This “figuring out” corresponds to the (1-3) triad.

…

I do make the weaker claim that, as of this writing, publicly-available AI models do not have the full (1-3) triad—generation, selection, and open-ended accumulation—to any significant degree. Specifically, foundation models are not currently set up to do the “selection” in a way that “accumulates”. For example, at an individual level, if a human realizes that something doesn’t make sense, they can and will alter their permanent knowledge store to excise that belief. Likewise, at a group level, in a healthy human scientific community, the latest textbooks delete the ideas that have turned out to be wrong, and the next generation of scientists learns from those now-improved textbooks. But for currently-available foundation models, I don’t think there’s anything analogous to that. The accumulation can only happen within a context window (which is IMO far more limited than weight updates), and also within pre- and post-training (which are in some ways anchored to existing human knowledge…).

So, I expect brain-like AGI to be always changing its weights during operation as it figures things out, like humans do. And not just a little, but in a way that open-endedly accumulates over time. As this goes on, the AGI is changing—it’s basically a distribution shift, but one that is invisible and endogenous. The AGI can come up with new possible plans and actions, and the way that the AGI thinks of its environment can change. This contributes to the difficulty of technical alignment, in the case of brain-like AGI. Even if the AGI has good intentions at a snapshot in time, it becomes a different AGI as you run it. More at that post.

2.6 Problems with “amplified oversight”

“Amplified oversight” (see e.g. here) is the general idea of using weaker-but-trustworthy AIs to oversee, test, or provide ground truth to stronger-but-untrusted AIs, as a way to bootstrap up to safe and powerful AI.

In addition to the foom-related logistical challenges discussed in the last post, here are some more issues I have with using AIs to oversee the safety and alignment of other AIs:

Lack of trustworthy AIs to start with

LLMs are basically fine, from my perspective, but I don’t think they’re really competent enough to make a huge difference. Once we get to the more powerful AGIs of the next paradigm, I don’t know how to make them trustworthy, even if they’re weak. (And I don’t think I’m blocked on that by my lack of a scheming mildly-sub-human AI to experiment with.) So I’m blocked before I even get to the first step of using weaker trustworthy AIs to help supervise stronger untrustworthy AIs.

Initially-trustworthy AIs can get corrupted by supervision

For another thing, even if I did have a mildly-sub-human AI-level-6 that I trusted, I would gradually stop trusting it as it spends time supervising a more powerful AI-level-7. After all, if AI-level-7’s motivation shifts as it comes up with new options and new ontologies (§2.5 above), what’s gonna happen to AI-level-6 as it tries to assess what AI-level-7 is doing? It’s going to have to learn those same new options and new ontologies itself! So the supervisory task is inherently corrupting.

Supervision doesn’t even really make sense in an autonomous continuous learning context

There’s an idea that AI-level-6 can do the things that AI-level-7 can do, but much less efficiently than AI-level-7, e.g. by using a big team of AI-level-6s and thinking longer. But that’s OK, because AI-level-6 needs only spot-check 0.1% of AI-level-7’s decisions and outputs.

But I don’t buy that. If AI-level-7 has spent 6 months figuring out a strategy and organization for a new startup company, all its decisions are building on all that idiosyncratic domain knowledge and situational knowledge. The AI-level-6 supervisor can’t just look at one output in isolation and figure out whether it’s good or bad, because it lacks all the online learning that AI-level-7 has been doing. Thus, it’s hardly any easier to spot-check 0.1% of outputs than to spot-check 100% of outputs by re-doing the entire project to that point.[17]

In other words, amplified oversight tends to rely on a mental picture like the left side of this figure, where the competence of an AI is static, but I think we need to be on the right side.

2.7 Downstream impacts of “Technical alignment is hard”

Until recently I didn’t appreciate how much thinking about how AGI will go more broadly is dependent on how hard alignment is. More specifically, doomer narratives (like mine) tend to involve the assumptions:

  • If we want to make an AI that’s trying to accomplish some goal G, then we probably need to pick G as a compromise between what we want and what alignment targets are feasible.
  • Even after compromising on G, we’ll probably suffer a high .
  • We need some yet-to-be-invented alignment ideas to even surpass the bar of avoiding egregious scheming towards violent takeover. (Forget about capturing the subtle nuances of human values!)

As examples, if you look at @So8res’s AGI ruin scenarios are likely (and disjunctive), I claim that a bunch of his AGI ruin scenarios rely on his belief that alignment is hard. I think that belief is correct! But still, it makes his argument less disjunctive than it might seem. Likewise, I now recognize that my own What does it take to defend the world against out-of-control AGIs? sneaks in a background assumption that alignment is hard (or alignment tax is high) in various places.

Conversely, I have long been perplexed by the fact that LLM-focused people widely believe that P(doom)≲50%. I didn’t get how they could be so optimistic. But I figured out that I can occupy that viewpoint better if I say to myself: “Claude seems nice, by and large, leaving aside some weirdness like jailbreaks. Now imagine that Claude keeps getting smarter, and that the weirdness gets solved, and bam, that’s AGI. Imagine that we can easily make a super-Claude that cares about your long-term best interest above all else, by simply putting ‘act in my long-term best interest’ in the system prompt or whatever.” Now, I don’t believe that, for all the reasons above, but when I put on those glasses I feel like a whole bunch of the LLM-focused AGI discourse—e.g. writing by Paul Christiano,[18] OpenPhil people, Redwood people, etc.—starts making more sense to me.

2.8 Bonus: Technical alignment is not THAT hard

Having said all that, I am not maximally pessimistic on technical alignment. I don’t think we have a plan, but I seem to be more bullish about our prospects for making progress than some of my fellow doomers, particularly Eliezer Yudkowsky.[19] Why the difference? Here are three areas where (AFAICT) we differ:

2.8.1 I think we’ll get to pick the innate drives (as opposed to the evolution analogy)

In the evolution of humans, there are two steps of indirection between the learning algorithm and the resulting behavior:

  1. The learning algorithm (i.e. evolution) designs the innate drives (hunger-drive, sex-drive, etc.) over evolutionary time,
  2. The innate drives sculpt desires and behaviors over the course of a lifetime (in conjunction with an environment, an evolutionarily-designed within-lifetime learning algorithm, etc.).

From the way that Eliezer invokes the evolution analogy, I get the strong impression that he expects the AGI technical alignment problem to correspondingly have two steps of indirection. (Example.)

By contrast, I expect only one step of indirection. I think that future AGI programmers will get to directly design the innate drives (reward function) and corresponding within-lifetime learning algorithm. Thus, for example, the best evolution-related analogy for AGI is different for me than Eliezer—see my discussion in “Definitely-not-evolution-I-swear” Provides Evidence for the Sharp Left Turn.

One step of indirection is still a problem! I think there will be an empty space in the source code repository that says “reward function”, and I claim that nobody knows what to put in that slot, such that the AGI won’t try to kill everyone!

…But still, one step of indirection is probably better than two. See §8.3.3 here for why I say “probably better”. Or even if it isn’t better, it’s at least different.

2.8.2 I’m more bullish on “impure consequentialism”

If I’m deciding between two possible courses of action, “consequentialist preferences” would make the decision based on the expected state of the world after I finish the course of action—or more centrally, long after I finish the course of action. By contrast, “other kinds of preferences” would allow the decision to depend on anything, even including what happens during the course-of-action. See Consequentialism & corrigibility for more discussion.

Consequentialist preferences lead to both power and danger—“power” because the best way to accomplish ambitious things is to want those things to wind up getting accomplished, and “danger” because of .

The bad news is: I strongly expect ASI to have some consequentialist preferences—see my post “Thoughts on Process-Based Supervision” §5.3. The good news is, I think it’s possible for ASI to also have non-consequentialist preferences.

Eliezer and some others, by contrast, seem to expect ASIs to behave like a pure consequentialist, at least as a strong default, absent yet-to-be-invented techniques. I think this is upstream of many of Eliezer’s other beliefs, including his treating corrigibility as “anti-natural”, or his argument that ASI will behave like a utility maximizer. The latter argument, I claim, begs the question by assuming pure-consequentialist preferences in the course of arguing for them. I spell this out via my silly “asshole restaurant customer” example at Consequentialism & corrigibility.

Why do Eliezer and others expect pure consequentialism? [UPDATE: …Or if I’m misreading Eliezer, as one commenter claims I am, replace that by: “Why might someone expect pure consequentialism?”]

I’m not sure, but here are three possible arguments, and my responses:

(1) The External Competition Argument: We’ll wind up with pure-consequentialist AIs because the pure-consequentialist AIs will outcompete the impure-consequentialist AIs.

My response: I don’t think this is a strong argument, because I’m mainly expecting a ASI Singleton (per §1.8.7 of the last post), and I think that an AI can easily be consequentialist enough to install itself as a singleton without being pure-consequentialist. For example, John von Neumann, like all humans, was not a pure consequentialist, but I’m confident that an AI comparable to a million super-speed telepathic von Neumann clones would be powerful enough to prevent any other AI from coming into existence in perpetuity.

(2) The Internal Competition Argument: We’ll wind up with pure-consequentialist AIs (absent some miraculous technical advance) because in the process of reflection within the mind of any given impure-consequentialist AI, the consequentialist preferences will squash the non-consequentialist preferences.

My response: I think this is probably wrong, or at least overstated, although I’m a bit uncertain.

At the very least, I haven’t seen any really compelling versions of this argument. I have, however, seen bad versions of this argument!

For example, I’ve seen people (maybe not Eliezer in particular) invoke the concept of “optimization pressure” as if it were a kind of exogenous force, rather than coming endogenously from the AI itself. Whereas in my view, an ASI will superintelligently optimize something if it wants to superintelligently optimize it, and not if it doesn’t, and it will do that via methods that it wants to employ, and not via methods that it doesn’t want to employ, etc.

As another example, I’ve seen people imagine non-consequentialist preferences as “rules that the AI grudgingly follows, while searching for loopholes”, rather than “preferences that the AI enthusiastically applies its intelligence towards pursuing”. But that’s mixing up two different issues. If a human sincerely wants to be a good friend, that’s a non-consequentialist preference, but the person may apply their full intelligence and creativity towards fulfilling that preference. See my response to the post Deep Deceptiveness by Nate Soares.

(3) The Training Argument: We’ll wind up with pure-consequentialist AIs (absent some miraculous technical advance) because the ML training process selects for models with low loss / high reward, and in the limit, the models with lowest loss / highest reward are the ones doing explicit pure-consequentialist pursuit of low loss / high reward. As ML techniques improve in the future, our actual trained models will approach that limit.

My response: Well, it depends on the details of the ML training approach. For brain-like AGI, I think this argument is pointing to something real and scary, but only because the strong default in practical RL is to use (what I call) “behaviorist” RL reward functions—see Self-dialogue: Do behaviorist rewards make scheming AGIs?. Whereas I think the human brain builds compassion and norm-following via “non-behaviorist” reward functions.[20] If that’s right, then a solution to the Training Argument problem probably exists, somewhere in the underdeveloped science of non-behaviorist RL reward functions. And this is a major area where I’m trying to make technical progress.

2.8.3 On the narrowness of the target

Eliezer mainly expects egregious misalignment, as do I, but if we get past that hurdle and keep marching up the logistic success curve, we eventually come upon the question of how narrow a target is “alignment” of a superintelligence. And I think Eliezer views the target as narrower than I do. I think there are two sources of that disagreement.

The first source of disagreement is related to the “impure consequentialism” point above. If ASI can stably be an impure consequentialist, that opens up a number of possible good AI motivations that are neither a “task AI” doing a specific thing, nor a god-like Singleton deciding the fate of the accessible universe according to its own preferences forever (in which case those preferences had better be bang-on!). Instead, for example, the AI could want to set up a “Long Reflection” and then defer to the results, whatever they are. Or more generally, the AI could want to acquire power and turn it over to some other person, institution, or process (which might or might not lead to permanent dictatorship or other very bad things, but perhaps not extinction). These kinds of non-consequentialist motivations seem plausibly less narrow a target than making an AI that directly fulfills its preferences about what to do with the future lightcone.

Second, philosophically, I think “goodness of the future” is somewhat of an incoherent mess, as opposed to being a well-defined scale with which we can measure how things turn out. For further discussion, see my post Valence & Normativity §2.7.

2.9 Conclusion and takeaways

2.9.1 If brain-like AGI is so dangerous, shouldn’t we just try to make AGIs via LLMs?

No.

First, I don’t think it’s possible to make AGIs that way.

Second, if I’m wrong, then I would just expect the LLM-AGIs to just go right ahead and invent the more powerful scary next-paradigm AGIs, and then we’re still in the same boat, unless the LLM-AGIs have systematically higher wisdom, cooperation, and coordination than humans do, which I don’t particularly expect.

Third and most importantly, if it is possible to make LLM-AGIs, then I think it would probably happen via eliminating all the reasons that today’s LLMs are not egregiously misaligned! In particular, I expect that they would involve the behavior being determined much more by RL and much less by pretraining (which brings in the concerns of §2.3–§2.4), and that they would somehow allow for open-ended continuous learning (which brings in the concerns of §2.5–§2.6).

A different possible claim is:

“LLMs definitely won’t scale to AGI (as I define it), even with further developments in RL, continuous learning, etc. So LLMs will remain just a normal “mundane” technology, perhaps as disruptive as the internet, or much less, and definitely not as disruptive as the industrial revolution, let alone as disruptive as the evolution of humans from chimps. We should develop this technology ASAP for the same reason that developing any other normal technology is generally good.”

This is, of course, a very common opinion in broader societal discourse around AI, even if it’s uncommon among AI alignment researchers today. My own response to the claim is: …Ehh, maybe, but I sure don’t feel enthusiastic about that. I’m just not that confident that LLMs will not scale to AGI and ASI. So I endorse thinking very hard about the contingency where they will. Anyway, I’ll leave that debate to others.

2.9.2 What’s to be done?

From my perspective, the next step is obvious: if technical alignment is hard, well let’s get to work. And as mentioned in §1.8.4 of the previous post, we need to do this ASAP, ideally long before we have any brain-like AGI to work with. But luckily, we do have plenty of information about brains, so we’re not completely in the dark. This is what I work on myself—see Intro to Brain-Like-AGI Safety and much more. This research program involves both neuroscience (e.g. Neuroscience of human social instincts: a sketch) and tying those ideas back to RL and AGI (e.g. this post, and more forthcoming!).

To be clear, having a plan that would solve technical alignment is necessary but not sufficient to avoid doom. Among other things, the plan would need to actually be implemented correctly, and then the resulting AI would need to do some kind of “AI for AI safety” thing (§1.8.6 of the previous post) to solve the problem that some other group will make a misaligned power-seeking ASI sooner or later. As I discussed there, I tentatively think , bad as they are, may be less bad than any other plan, all things considered. Remember, as Scott Alexander points out, if AGI developers find themselves actually living inside the crazy terrifying future world that I’m expecting, as described in these two posts, then all the considerations and tradeoffs around pivotal acts will feel quite different than they do today.

Thanks Charlie Steiner, Jeremy Gillen, Seth Herd, and Justis Mills for critical comments on earlier drafts.

  1. ^

    Some people are doomers for reasons unrelated to “technical alignment is hard”; relevant other issues include gradual disempowerment and offense-defense balance. Those are outside the scope of this post and series.

  2. ^

    A number of people seem to disagree with this sentiment. For example, @paulfchristiano wrote in 2023 that “it's been more than 10 years with essentially no changes to the basic paradigm that would be relevant to alignment”. @ryan_greenblatt likewise told me (IIRC) “I think things will be continuous”, and I asked whether the transition in AI zeitgeist from RL agents (e.g. MuZero in 2019) to LLMs counts as “continuous” in his book, and he said “yes”, adding that they are both “ML techniques”. I find this perspective baffling—I think MuZero and LLMs are wildly different from an alignment perspective. Hopefully this post will make it clear why. (And I think human brains work via “ML techniques” too.) [UPDATE: Ryan elaborates on his views in this comment.]

  3. ^

    See §2.3.2 below for why I say “99%”.

  4. ^

    Explicit plans don’t have to be grand and long-term. I can be studying chemistry as part of an explicit plan to get into med school, but I can also be moving my arm as part of an explicit plan to scratch my nose. See Incentive Learning vs Dead Sea Salt Experiment §5.2.2.

  5. ^

    A few parts of the upcoming discussion overlaps with (but is edited from) what I wrote here & here. Also related: Why I’m not into the Free Energy Principle, especially §8.

  6. ^

    For more on this important point from a common-sense perspective, see “Heritability: Five Battles” §2.5.1, and from a neuroscience perspective, see “Valence & Liking / Admiring” §4.5.

  7. ^

    For newcomers to the field, an “LLM Base Model”—or as we called back in the day, “an LLM”, since no other kind existed until around 2022—is trained by only self-supervised learning (what we now call pretraining), such that it will continue an arbitrary string of text. For example, if you prompt a base model with “What’s the largest island in Indonesia?”, it won’t necessarily answer the question, but will rather guess how that text would continue, which might be something like “What decade was General Motors Founded? What author used the pen name Dr. Seuss? When you finish the trivia round, please return your answer sheets to the front.”

  8. ^

    The recent paper “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” (LW discussion) offers some actual numbers on this part, although I’m not sure how much to trust them.

  9. ^

    I guess here I should also respond to GPTs are Predictors, not Imitators by Eliezer Yudkowsky (2023). I basically think Eliezer has not internalized the weirdness of “LLM pretraining magically transmutes observations into behavior”, and that he is instead intuitively anchored at the human kind of prediction, where we humans can expect something, and separately decide how to act on that expectation, based on what we want. I think this basic confusion is behind a lot of Eliezer’s discussion of LLMs that strike me as misleading, such as his frequent invocation of “an actress behind the mask”. For much more discussion of this point, see @Zack_M_Davis’s fictional dialog trilogy (1, 2, 3). (Having said all that, I want to reiterate that “trained LLM behavior is just like human behavior” is also wrong, thanks to the post-training and generalization issues, as discussed in this section.)

  10. ^

    See The Psychopath Test by Ronson (“All those chats about empathy were like an empathy-faking finishing school for him…” p88).To be clear, that there do seem to be interventions that appeal to psychopaths’ own self-interest—particularly their selfish interest in not being in prison—to help turn really destructive psychopaths into the regular everyday kind of psychopaths who are still awful to the people around them but at least they’re not murdering anyone. (Source.)

  11. ^

    Both Anthropic and OpenAI have admitted to this error, see links here.

  12. ^

    For a real-world example, @Matthew Barnett is much more polite about it, but I think he’s getting at the same idea when he writes: “The fact that GPT-4 can reliably follow basic instructions, is able to distinguish moral from immoral actions somewhat reliably, and generally does what I intend rather than what I literally asked, is all evidence that the value identification problem is easier than how MIRI people originally portrayed it. While I don't think the value identification problem has been completely solved yet, I don't expect near-future AIs will fail dramatically on the ‘fill a cauldron’ task [referring to The Sorcerer’s Apprentice as discussed by Nate Soares here], or any other functionally similar tasks.”

  13. ^

    To be clear, the idea of leveraging inner misalignment to mitigate issues from outer misalignment is less crazy than it might sound, see §10.6 here.

  14. ^

    I think trained classifiers are tangentially involved in human social instincts, but they are relatively simple things like “this audio clip is probably the sound of a human voice”. They don’t have to be perfect. See §1 of “Neuroscience of human social instincts: a sketch”.

  15. ^

    Warning: The term “reward hacking” has recently started being used for the broader notion of “lying and cheating”, even when the lying and cheating is not directly related to any reward function. See my nitpicky complaint here.

  16. ^

    The term “exploration hacking” is either synonymous with, or a special case of, what I’ve been calling “deliberate incomplete exploration”.

  17. ^

    Somewhat related: §5.3.2–5.3.3 of “Thoughts on ‘Process-Based Supervision’”.

  18. ^

    It’s a bit misleading to describe Paul Christiano as “LLM-focused”, since he’s been saying generally consistent things about the alignment problem since well before LLMs. But I think he has always had in mind AI systems which centrally involve “magically transmuting observations (of humans) into behaviors”. For example, his pre-LLM “Iterated Amplification” schemes often (not always) involved imitating human-created data.

  19. ^

    Eliezer, in turn, is deliriously over-optimistic compared to Roman Yampolskiy, who argues that technical alignment is impossible ¯\_(ツ)_/¯

  20. ^

    For my latest thinking on the nature of those “non-behaviorist” reward functions, see Neuroscience of human social instincts: a sketch.

“alignment tax”
alignment tax
“pivotal acts”
cite
cite
Mentioned in
269Foom & Doom 1: “Brain in a box in a basement”
30AI #123: Moratorium Moratorium
instrumental convergence