Foom & Doom 2: Technical alignment is hard

[-]ryan_greenblatt4moΩ4125

@ryan_greenblatt likewise told me (IIRC) “I think things will be continuous”, and I asked whether the transition in AI zeitgeist from RL agents (e.g. MuZero in 2019) to LLMs counts as “continuous” in his book, and he said “yes”, adding that they are both “ML techniques”. I find this perspective baffling—I think MuZero and LLMs are wildly different from an alignment perspective. Hopefully this post will make it clear why. (And I think human brains work via “ML techniques” too.)

I don't think this is an accurate paraphrase of my perspective.

My view is:

Both of MuZero and LLMs are within an ML paradigm and I expect that many/most of the techniques I think about transfer between AGI made using either style of methods.
I think that you can continously transition between MuZero and LLMs and I expect that if a MuZero like paradigm happens, this is probably what will happen. (As in, you'll use LLMs as a component in the MuZero approach or similar.)
I don't expect that a transition from the current LLM paradigm to the MuZero-style paradigm would result in massively discontinuous takeoff speeds (as in, I think takeoff speeds are continuous) because before you have a full AGI from the MuZero style approach, you'll have a worse AI from the MuZero approach. See [this comment for more discussion](https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/foom-and-doom-1-brain-in-a-box-in-a-basement?commentId=mZKP2XY82zfveg45B). This is even aside from continuously transitioning between the two.
In practice, I think that the actual historical transition from MuZero (or other pure RL agents) to LLMs didn't cause a huge trend break or discontinuity in relevant downstream metrics (e.g. benchmark scores).
I agree that in practice MuZero and LLMs weren't developed continuously. I would say that this is because the MuZero approach didn't end up being that useful for any of the tasks we cared about and was outcompeted pretty dramatically.
I agree these can be very different from an alignment perspective but things like RLHF, interpretability, and control seem to me like they straightforwardly can be transfered.

[-]Aprillion4mo10

hm, as a non-expert onlooker, I found the paraphrase pretty accurate.. for sure it sounds more reasonable in your own words here compared to the oversimplified summary (so thank you for clarification!), but as far as accuracy of summaries go, this one was top tier IMHO (..have you seen the stuff that LLMs produce?!)

[-]ryan_greenblatt4mo90

I agree that my view is that they can count as continuous (though the exact definition of the word continuous can matter!), but then the statement "I find this perspective baffling— think MuZero and LLMs are wildly different from an alignment perspective" isn't really related to this from my perspective. Like things can be continuous (from a transition or takeoff speeds perspective) and still differ substantially in some important respects!

[-]Aprillion4mo10

I somehow completely agree with both of your perspectives, have you tried to ban the word "continuous" in your discussions yet? (on the other hand, I don't think it should be a crux, probably just ambiguous meaning like "sound" in the "when a tree falls" thingy ... but I would be curious if you would be able to agree on the 2 non-controversial meanings between the 2 of you)

It reminds me of stories about gradualism / saltationism debate in evolutionary biology after gradualism won and before the idea of punctuated equilibrium... Parents and children are pretty discreet units, but gene pools over millions of years are pretty continuous from the perspective of an observer long long time later who is good at spotting low-frequency patterns ¯\_(ツ)_/¯

For a researcher, even GPT 3.5 to 4 might have been a big jump in terms of compute budget approval process (and/or losing a job from disbanding a department). And the same event on a benchmark might look smooth - throughout multiple big architecture changes a la the charts that illustrate Moore's law - the sweat and blood of thousands of engineers seems kinda continuous if you squint enough.

And what even is "continuous" - general relativity is a continuous theory, but my phone calculates my GPS coordinates with numerical methods, time dilation from gravity field/the geoid shape is just approximated and nanosecond(-ish) precision is good enough to pin me down as much as I want (TBH probably more precision that I would choose myself as a compromise with my battery life). Real numbers are continuous, but they are not computable (I mean in practice in our own universe, I don't care about philosophical possibilities), so we approximate them with a finite set of kinda shitty rational-ish numbers for which even 0.1 + 0.2 == 0.3 is false (in many languages, including JS in a browser console and in Python)..

Some stuff will work "the same" in the new paradigm, some will be "different" - does it matter whether we call it (dis)continuous, or do we know already what to predict in more detail?

[-]ryan_greenblatt4mo20

I somehow completely agree with both of your perspectives, have you tried to ban the word "continuous" in your discussions yet?

I agree taboo-ing is a good approach in this sort of case. Talking about "continuous" wasn't a big part of my discussion with Steve, but I agree if it was.

[-]Jeremy Gillen4moΩ5100

(Overall I like these posts in most ways, and especially appreciate the effort you put into making a model diff with your understanding of Eliezer's arguments)

Eliezer and some others, by contrast, seem to expect ASIs to behave like a pure consequentialist, at least as a strong default, absent yet-to-be-invented techniques. I think this is upstream of many of Eliezer’s other beliefs, including his treating corrigibility as “anti-natural”, or his argument that ASI will behave like a utility maximizer.

It feels like you're rounding off Eliezer's words in a way that removes the important subtlety. What you're doing here is guessing at the upstream generator of Eliezer's conclusions, right? As far as I can see in the links, he never actually says anything that translates to "I expect all ASI preferences to be over future outcomes"? It's not clear to me that Eliezer would disagree with "impure consequentialism".

I think you get closest to an argument that I believe with (2):

(2) The Internal Competition Argument: We’ll wind up with pure-consequentialist AIs (absent some miraculous technical advance) because in the process of reflection within the mind of any given impure-consequentialist AI, the consequentialist preferences will squash the non-consequentialist preferences.

Where I would say it differently, like: An AI that has a non-consequentialist preference against personally committing the act of murder won't necessarily build its successor to have the same non-consequentialist preference^[1], whereas an AI that has a consequentialist preference for more human lives will necessarily build its successor to also want more human lives. Non-consequentialist preferences need extra machinery in order to be passed on to successors. (And building successors is a similar process to self-modification).

As another example, I’ve seen people imagine non-consequentialist preferences as “rules that the AI grudgingly follows, while searching for loopholes”, rather than “preferences that the AI enthusiastically applies its intelligence towards pursuing”.

I think you're misrepresenting/misunderstanding the argument people are making here. Even when you enthusiastically apply your intelligence toward pursuing a deontological constraint (alongside other goals), you implicitly search for "loopholes" in that constraint, i.e. weird ways to achieve all of your goals that don't involve violating the constraint. To you, they aren't loopholes, they're clever ways to achieve all goals.

^{^}
Perhaps this feels intuitively incorrect. If so, I claim that's because your preferences against committing murder are supported by a bunch of consequentialist preferences for avoiding human suffering and death. A real non-consequentialist preference is more like the disgust reaction to e.g. picking up insects. Maybe you don't want to get rid of your own disgust reaction, but you're okay finding (or building) someone else to pick up insects for you if that helps you achieve your goals. And if it became a barrier to achieving your other goals, maybe you would endorse getting rid of your disgust reaction.

[-]Steven Byrnes4moΩ690

Thanks!

Hmm, here’s a maybe-interesting example (copied from other comment):

If an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible to me.

What’s happening is that this example is in the category “I want the world to continuously retain a certain property”. That’s a non-indexical desire, so it works well with self-modification and successors. But it’s also not-really-consequentialist, in the sense that it’s not (just) about the distant future, and thus doesn’t imply instrumental convergence (or at least doesn’t imply every aspect of instrumental convergence at maximum strength).

(This is a toy example to illustrate a certain point, not a good AI motivation plan all-things-considered!)

Speaking of which, is it possible to get stability w.r.t. successors and self-modification while retaining indexicality? Maybe. I think things like “I want to be virtuous” or “I want to be a good friend” are indexical, but I think we humans kinda have an intuitive notion of “responsibility” that carries through to successors and self-modification. If I build a robot to murder you, then I didn’t pull the trigger, but I was still being a bad friend. Maybe you’ll say that this notion of “responsibility” allows loopholes, or will collapse upon sufficient philosophical understanding, or something? Maybe, I dunno. (Or maybe I’m just mentally converting “I want to be a good friend” into the non-indexical “I want you to continuously thrive”, which is in the category of “I want the world to continuously retain a certain property” mentioned above?) I dunno, I appreciate the brainstorming.

[-]Jeremy Gillen4moΩ360

“I want the world to continuously retain a certain property”. That’s a non-indexical desire, so it works well with self-modification and successors.

I agree that goals like this work well with self-modification and successors. I'd be surprised if Eliezer didn't. My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It's strawmanning. And it isn't supported by any of the links you cite. I think you must have some mistaken assumption about Eliezer's views that is leading you to infer that he believes AIs must only have preferences over the distant future. But I can't tell what it is. One guess is: to you, corrigibility only looks hard/unnatural if preferences are very strictly about the far future, and otherwise looks fairly easy.

But it’s also not-really-consequentialist, in the sense that it’s not (just) about the distant future, and thus doesn’t imply instrumental convergence (or at least doesn’t imply every aspect of instrumental convergence at maximum strength).

I would still call those preferences consequentialist, since the consequences are the primary factor that determines the actions. I.e. the behaviour is complicated, but in a way that easy to explain once you know what the behaviour is aimed at achieving. They're even approximately long-term consequentialist, since the actions are (probably?) mostly aimed at the long-term future. The strict definition you call "pure consequentialism" is a good approximation or simplification of this, under some circumstances, like when value adds up over time and therefore the future is a bigger priority than the immediate present.

No one I know has argued that AI or rational people can only care about the distant future. People spend money to visit a theme park sometimes, in spite of money being instrumentally convergent.

Maybe you’ll say that this notion of “responsibility” allows loopholes, or will collapse upon sufficient philosophical understanding, or something? Maybe, I dunno.

Some versions of that does have loopholes, but overall I think I agree that you could get a lot of stability that way. (But as far as I can tell, the versions with fewer loopholes look more like consequence-based goals rather than rules that say which kinds of local actions-sequences are good and bad).

(Or maybe I’m just mentally converting “I want to be a good friend” into the non-indexical “I want you to continuously thrive”, which is in the category of “I want the world to continuously retain a certain property” mentioned above?)

Yeah this is exactly what I had an issue with in my sibling discussion with Ryan. He seems to think {integrity,honesty,loyalty} are deontological, whereas the way they are implemented in me is as a mix of consequentialist reasoning (e.g. some components are "does this person end up better off, by their own lights?", "do they understand what I'm doing and why?") and a bunch of soft rules designed to reduce the chances that I accidentally rationalise actions that are ultimately hurtful for complicated reasons that are difficult to see in the moment (e.g. "in the course of my plan, don't cross privacy boundaries that likely lead me to gain information that they might not have felt comfortable with me knowing"). But the rules aren't a primary driver of action, they are relatively weak constraints that quickly rule out bad plans (that almost always would have been bad for consequentialist reasons).

For me, it's similar when I want to be a good friend.

[-]Steven Byrnes4moΩ362

My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It's strawmanning. And it isn't supported by any of the links you cite.

For the record, my OP says something weaker than that—I wrote “Eliezer and some others…seem to expect ASIs to behave like a pure consequentialist, at least as a strong default…”.

Maybe this is a pointless rabbit’s hole, but I’ll try one more time to argue that Eliezer seems to have this expectation, whether implicitly or explicitly, and whether justified or not:

For example, look at Eliezer’s Coherent decisions imply consistent utilities, and then reflect on the fact that knowing that an agent is “coherent”, a.k.a. a “utility maximizer”, tells you nothing at all about its behavior, unless you make additional assumptions about the domain of its utility function (e.g. that the domain is ‘the future state of the world’). To me it seems clear that

Either Eliezer is making those “additional assumptions” without mentioning them in his post, which supports my claim that pure-consequentialism is (to him) a strong default;
Or his post is full of errors, because for example he discusses whether an AI will be “visibly to us humans shooting itself in the foot”, when in fact it’s fundamentally impossible for an external observer to know whether an agent is being incoherent / self-defeating or not, because (again) coherent utility-maximizing behaviors include absolutely every possible sequence of actions.

[-]Jeremy Gillen4moΩ350

Sorry if I misrepresented you, my intended meaning matches what you wrote. I was trying to replace "pure consequentialist" with its definition to make it obvious that it's a ridiculously strong expectation that you're saying Eliezer and others have.

Yes, assumptions about the domain of the utility function are needed in order to judge its behaviour as coherent or not. Rereading Coherent decisions imply consistent utilities, Eliezer is usually clear about the assumed domain of the utility function in each thought experiment. For example, he's very clear here that you need the preferences as an assumption:

Have we proven by pure logic that all apples have the same utility? Of course not; you can prefer some particular apples to other particular apples. But when you're done saying which things you qualitatively prefer to which other things, if you go around making tradeoffs in a way that can be viewed as not qualitatively leaving behind some things you said you wanted, we can view you as assigning coherent quantitative utilities to everything you want.
And that's one coherence theorem—among others—that can be seen as motivating the concept of utility in decision theory."

In the hospital thought experiment, he specifies the goal as an assumption:

Robert only cares about maximizing the total number of lives saved. Furthermore, we suppose for now that Robert cares about every human life equally.

In the pizza example, he doesn't specify the domain, but it's fairly obvious implicitly. In the fruit example, it's also implicit but obvious.

There's a few paragraphs at the end of the Allias paradox section about the (very non-consequentialist) goal of feeling certain during the decision-making process. I don't get the impression from those paragraphs that Eliezer is saying that this preference is ruled out by any implicit assumption. In fact he explicitly says that this preference isn't mathematically improper. It seems he's saying this kind of preference cuts against coherence only if it's getting in the way of more valuable decisions:

'The danger of saying, "Oh, well, I attach a lot of utility to that comfortable feeling of certainty, so my choices are coherent after all" is not that it's mathematically improper to value the emotions we feel while we're deciding. Rather, by saying that the most valuable stakes are the emotions you feel during the minute you make the decision, what you're saying is, "I get a huge amount of value by making decisions however humans instinctively make their decisions, and that's much more important than the thing I'm making a decision about." This could well be true for something like buying a stuffed animal. If millions of dollars or human lives are at stake, maybe not so much.'

I think this quote in particular invalidates your statements.

There is a whole stack of assumptions^[1] that Eliezer isn't explicit about in that post. It's intended to give a taste of the reasoning that gives us probability and expected utility, not the precise weakest set of assumptions required to make a coherence argument work.

I think one thing that is missing from that post are the reasons we usually do have prior knowledge of goals (among humans and for predicting advanced AI). Among humans we have good priors that heavily restrict the goal-space, plus introspection and stated preferences as additional data. For advanced AI, we can usually use usefulness (on some specified set of tasks) and generality (across a very wide range of potential obstacles) to narrow down the goal-domain. Only after this point, and with a couple of other assumptions, do we apply coherence arguments to show that it's okay to use EUM and probability.

The reason I think this is worth talking about is that I was actively confused about exactly this topic in the year or two before I joined Vivek's team. Re-reading the coherence and advanced agency cluster of Arbital posts (and a couple of comments from Nate) made me realise I had misinterpreted them. I must have thought they were intended to prove more than they do about AI risk. And this update flowed on to a few other things. Maybe partially because the next time I read Eliezer as saying something that seemed unreasonably strong I tried to steelman it and found a nearby reasonable meaning. And also because I had a clearer idea of the space of agents that are "allowed", and this was useful for interpreting other arguments.

I'd be happy to call if that's a more convenient way to talk, although it is nice to do this publicly. Also completely happy to stop talking about this if you aren't interested, since I think your object-level beliefs about this ~match mine ("impure consequentialism" is expected of advanced AI).

^{^}
E.g. I think we need a bunch of extra structure about self-modification to apply anything like a money pump argument to resolute/updateless agents. I think we need some non-trivial arguments and an assumption to make the VNM continuity money pump work. I remember there being some assumption that went into complete class that I thought was non-obvious, but I've forgotten exactly what it was. The post is very clear that it's just giving a few tastes of the kind of reasoning needed to pin down utility and probability as a reasonable model of advanced agents.

[-]Steven Byrnes4moΩ372

Probably not worth the time to further discuss what certain other people do or don’t believe, as opposed to what’s true. I remain unconvinced but added a caveat to the article just to be safe:

Why do Eliezer and others expect pure consequentialism? [UPDATE: …Or if I’m misreading Eliezer, as one commenter claims I am, replace that by: “Why might someone expect pure consequentialism?”]

[-]Towards_Keeperhood2moΩ110

I want to note that it seems to me that Jeremy is trying to argue you out of the same mistake I tried to argue you out in this thread.

The problem is that you use "consequentialism" different than Eliezer means it. I suppose he only used the word in a couple of occasions where he tried to get accross the basic underlying model without going into excessive details, and it may read to you like your "far futue outcome pumping" matches your definitions there (though back when I looked over your cited support that Eliezer means it, it didn't seem at all like the evidence points to this interpretation). But if you get a deep understanding of logical decision theory, or you study a lot of MIRI papers where they (where the utility of agents is iirc always over trajectories of the environment program^[1]), you see what Eliezer's deeper position is.

Probably not worth the time to further discuss what certain other people do or don’t believe, as opposed to what’s true.

I think you're strawmanning Eliezer and propagating a wrong understanding of what "consequentialism" was supposed to refer to, and this seems like an important argument to have separately from what's true. But a good point that we should distinguish arguing about this from arguing about what's true.

Going forward, I suggest you use another word like "farfuturepumping" instead of "consequentialism". (I'll also use another word for Eliezer::consequentialism and clarify it, since it's apparently often misunderstood.)

As quick summary, which may not be easily understandable due to inferential distance, I think that me and Eliezer both think that:

Smart AIs will be utility optimizing, but this utility is over computations/universe-trajectories, not future states.
1. This is a claim about how AI cognition will look like, not just about that its behavior will be coherent according to some utility function. Smart AIs will think in some utility-maximizing ways, even though it initially may be quite a mess where it's really hard to read off what values are being optimized for, and the values may change a bit as the AI changes.
  1. Coherence arguments only imply that a coherent agent will behave as if they optimized a utility function, not about what cognitive algorithm the agent uses. There's an extra step needed to get to cognitive utility maximization, and AFAIK it hasn't been explained well anywhere, but maybe it's sorta intuitive?
It's perfectly alright to have non-farfuturepumping preferences like you describe, but just saying it's possible isn't enough, you actually need to write down the utility function over universe-trajectories.
1. This is because if you just say "well it's possible, so there", you may fail to think concretely enough to see how a utility function that has the properties you imagine would actually be quite complex, and thus unlikely to be learned.
2. Why can't you have a utility function but also other preferences?
  1. There needs to be some tradeoff between the utility function and the other preferences, and however you choose this the result can be formalized as utility function. If you don't do this you can engage in abstract wishful thinking where you can imagine a different tradeoff for different cases and thereby delude yourself about your proposal robustly working.
3. Why can't you just specify that in some cases utility function u1 should be used, and in others u2 should be used?
  1. Because when u1 is used then there's an instrumental incentive to modify the code of the AI s.t. always u1 is used. You want reflective consistency to avoid such problems.

I would recommend you to chat with Jeremy (and maybe reread our comment thread).

^{^}
Yes utility is often formalized over the space of outcomes, but the space of outcomes is iirc the space of trajectories.

[-]Steven Byrnes2moΩ671

My read of this conversation is that we’re basically on the same page about what’s true but disagree about whether Eliezer is also on that same page too. Again, I don’t care. I already deleted the claim about what Eliezer thinks on this topic, and have been careful not to repeat it elsewhere.

Since we’re talking about it, my strong guess is that Eliezer would ace any question about utility functions and what’s their domain and when is “utility-maximizing behavior” vacuous, … if asked directly.

But it’s perfectly possible to “know” something when asked directly, but also to fail to fully grok the consequences of that thing and incorporate it into some other part of one’s worldview. God knows I’m guilty of that, many many times over!

Thus my low-confidence guess is that Eliezer is guilty of that too, in that the observation “utility-maximizing behavior per se is vacuous” (which I strongly expect he would agree with if asked directly) has not been fully reconciled with his larger thinking on the nature of the AI x-risk problem.

(I would further add that, if Eliezer has fully & deeply incorporated “utility-maximizing behavior per se is vacuous” into every other aspect of his thinking, then he is bad at communicating that fact to others, in the sense that a number of his devoted readers wound up with the wrong impression on this point.)

Anyway, I feel like your comment is some mix of “You’re unfairly maligning Eliezer” (again, whatever, I have stopped making those claims) and “You’re wrong that this supposed mistake that you attribute to Eliezer is a path through which we can solve the alignment problem, and Eliezer doesn’t emphasize it because it’s an unimportant dead-end technicality” (maybe! I don’t claim to have a solution to the alignment problem right now; perhaps over time I will keep trying and failing and wind up with a better appreciation of the nature of the blockers).

Most of your comment is stuff I already agree with (except that I would use the term “desires” in most places that you wrote “utility function”, i.e. where we’re talking about “how AI cognition will look like”).

I don’t follow what you think Eliezer means by “consequentialism”. I’m open-minded to “farfuturepumping”, but only if you convince me that “consequentialism” is actually misleading. I’m don’t endorse coining new terms when an existing term is already spot-on.

[-]Towards_Keeperhood2mo10

To me it seems a bit surprising that you say we agree on the object level, when in my view you're totally guilty of my 2.b.i point above of not specifying the tradeoff / not giving a clear specification of how decisions are actually made.

I also think the utility maximizer frame is useful, though there are 2 (IMO justified) assumptions that I see as going along with it:

There's sth like a simplicity prior over the space of utility functions (because there needs to be some utility maximizing structure implemented in the AI).
The utility function is a function of the trajectory of the environment. (Or in even better formalization it may take as input a program which is the environment.)
1. I think using a learned value function (LVF) that computes valence of thoughts is a worse frame to use for tackling corrigibility because it's harder to clearly evaluate what actions the agent will end up taking. And because this kind of "imagine some plan and what the outcome would be and let the LVF evaluate that" doesn't seem to me how smarter than human minds operate - considering what change in the world an action would cause seems more natural than whether some imagined scene seems appealing. Even humans like me move away from the LVF frame, e.g. I'm trying to correct for scope insensitivity of my LVF by doing sth more like explicit expected utility calculations.^[1]

“You’re wrong that this supposed mistake that you attribute to Eliezer is a path through which we can solve the alignment problem, and Eliezer doesn’t emphasize it because it’s an unimportant dead-end technicality” (maybe! I don’t claim to have a solution to the alignment problem right now; perhaps over time I will keep trying and failing and wind up with a better appreciation of the nature of the blockers).

I'm more like "Your abstract guesturing didn't let me see any concrete proposal that would make me more hopeful, and even if good proposals are in that direction it seems to me like most of the work would still be ahead instead of it being like 'we can just do it sorta like that' as you seem to present it. But maybe I'm wrong and maybe you have more intuitions and will find a good concrete proposal.".

I don’t follow what you think Eliezer means by “consequentialism”. I’m open-minded to “farfuturepumping”, but only if you convince me that “consequentialism” is actually misleading.

Maybe study logical decision theory? Not sure where to best start but maybe here:

"Logical decision theories" are algorithms for making choices which embody some variant of "Decide as though you determine the logical output of your decision algorithm."

Like consequentialism in the sense of "what's the consequence of choosing the logical output of your decision algorithm in a particular way", where consequence here isn't a time-based event but rather the way the universe looks like conditional on the output of your decision algorithm.

^{^}
I'm not confident those are the only reason why LVF seems worse here, I didn't fully articulate my intuitions yet.

[-]Steven Byrnes2mo20

Maybe study logical decision theory?

Eliezer has always been quite clear that you should one-box for Newcomb’s problem because then you’ll wind up with more money. The starting point for the whole discussion is a consequentialist preference—you have desires about the state of the world after the decision is over.

You have desires, and then decision theory tells you how to act so as to bring those desires about. The desires might be entirely about the state of the world in the future, or they might not be. Doesn’t matter. Regardless, whatever your desires are, you should use good decision theory to make decisions that will lead to your desires getting fulfilled.

Thus, decision theory is unrelated to our conversation here. I expect that Eliezer would agree.

To me it seems a bit surprising that you say we agree on the object level, when in my view you're totally guilty of my 2.b.i point above of not specifying the tradeoff / not giving a clear specification of how decisions are actually made.

Your 2.a is saying “Steve didn’t write down a concrete non-farfuturepumping utility function, and maybe if he tried he would get stuck”, and yeah I already agreed with that.

Your 2.b is saying “Why can't you have a utility function but also other preferences?”, but that’s a very strange question to me, because why wouldn’t you just roll those “other preferences” into the utility function as you describe the agent? Ditto with 2.c, why even bring that up? Why not just roll that into the agent’s utility function? Everything can always be rolled into the utility function. Utility functions don’t imply anything about behavior, and they don’t imply reflective consistency, etc., it’s all vacuous formalizing unless you put assumptions / constraints on the utility function.

[-]Towards_Keeperhood2mo10

The purpose of studying LDT would be to realize that the type signature you currently imagine Steve::consequentialist preferences to have is different from the type signature that Eliezer would imagine.

The starting point for the whole discussion is a consequentialist preference—you have desires about the state of the world after the decision is over.

You can totally have preferences about the past that are still influenced by your decision (e.g. Parfit's hitchhiker).

Decisions don't cause future states, they influence which worlds end up real vs counterfactual. Preferences aren't over future states but over worlds - which worlds would you like to be more real?

AFAIK Eliezer only used the word "consequentialism" in abstract descriptions of the general fact that you (usually) need some kind of search in order to find solutions to new problems. (Like I think just using a new word for what he used to call optimization.) Maybe he also used the outcome pump as an example, but if you asked him what how consequentialist preferences look like in detail, I'd strongly bet he'd say sth like preferences over worlds rather than preferences over states in the far future.

[-]ryan_greenblatt4moΩ672

Where I would say it differently, like: An AI that has a non-consequentialist preference against personally committing the act of murder won't necessarily build its successor to have the same non-consequentialist preference[1], whereas an AI that has a consequentialist preference for more human lives will necessarily build its successor to also want more human lives. Non-consequentialist preferences need extra machinery in order to be passed on to successors.

[...]

Perhaps this feels intuitively incorrect. If so, I claim that's because your preferences against committing murder are supported by a bunch of consequentialist preferences for avoiding human suffering and death. A real non-consequentialist preference is more like the disgust reaction to e.g. picking up insects. Maybe you don't want to get rid of your own disgust reaction, but you're okay finding (or building) someone else to pick up insects for you if that helps you achieve your goals. And if it became a barrier to achieving your other goals, maybe you would endorse getting rid of your disgust reaction.

Hmm, imagine we replace "disgust" with "integrity". As in, imagine that I'm someone who is strongly into the terminal moral preference of being an honest and high integrity person. I also value loyalty and pointing out ways in which my intentions might differ from what someone wants. Then, someone hires me (as an AI let's say) and tasks me with building a successor. They also instruct me: 'Make sure the AI successor you build is high integrity and avoids disempowering humans. Also, generalize the notion of "integrity, loyalty, and disempowerment" as needed to avoid these things breaking down under optimization pressure (and get your successors to do the same. And, let me know if you won't actually do a good job following these instructions, e.g. because you aren't actually that well aligned. Like, tell me if you wouldn't actually try hard and please be seriously honest with me about this."

In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn't robustly pursue the interests of the developer. That's not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.

Another way to put this is that the deontological constraints we want are like the human notions of integrity, loyalty, and honesty (and to then instruct the AI that we want this constraints propogated forward). I think an actually high integrity person/AI doesn't search for loopholes or want to search for loopholes. And the notion of "not actually loopholes" generalizes between different people and AIs I'd claim. (Because notions like "the humans remained in control" and "the AIs stayed loyal" are actually relatively natural and can be generalized.)

I'm not claiming you can necessarily instill these (robust and terminal) deontological preferences, but I am disputing they are similar to non-reflectively endorsed (potentially non-terminal) deontological constraints or urges like disgust. (I don't think disgust is an example of a deontological constraint, it's just an obviously unendorsed physical impulse!)

[-]Jeremy Gillen4moΩ141

In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn't robustly pursue the interests of the developer. That's not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.

Yes, agreed. The extra machinery and assumptions you describe seem sufficient to make sure nonconsequentialist preferences are passed to a successor.

I think an actually high integrity person/AI doesn't search for loopholes or want to search for loopholes.

If I try to condition on the assumptions that you're using (which I think include a central part of the AIs preferences having a true-but-maybe-approximate pointer toward the instruction-givers preferences, and also involves a desire to defer or at least flag relevant preference differences) then I agree that such an AI would not search for loopholes on the object-level.

I'm not sure whether you missed the straightforward point I was trying to make about searching for loopholes, or whether you understand it and are trying to point at a more relevant-to-your-models scenario? The straightforward point was that preference-like objects need to be robust to search. Your response reads as "imagine we have a bunch of higher-level-preferences and protective machinery that already are robust to optimisation, then on the object level these can reduce the need for robustness". This is locally valid.

I don't think its relevant because we don't know how to build those higher-level-preferences and protective machinery in a way that is itself very robust to the OOD push that comes from scaling up intelligence, learning, self-correcting biases, and increased option-space.

(I don't think disgust is an example of a deontological constraint, it's just an obviously unendorsed physical impulse!)

Some people reflectively endorse their own disgust at picking up insects, and wouldn't remove it if given the option. I wanted an example of a pure non-consequentialist preference, and I stand by it as a good example.

deontological constraints we want are like the human notions of integrity, loyalty, and honesty

Probably we agree about this, but for the sake of flagging potential sources of miscommunication: if I think about the machinery involved in implementing these "deontological" constraints, there's a lot of consequentialist machinery involved (but it's mostly shorter-term and more local than normal consequentialist preferences).

[-]ryan_greenblatt4mo*Ω220

I was trying to argue that the most natural deontology-style preferences we'd aim for are relatively stable if we actually instill them. So, I think the right analogy is that you either get integrity+loyalty+honesty in a stable way, some bastardized version of them such that it isn't in the relevant attractor basin (where the AI makes these properties more like what the human wanted), or you don't get these things at all (possibly because the AI was scheming for longer run preferences and so it faked these things).

And I don't buy that the loophole argument applies unless the relevant properties are substantially bastardized. I certainly agree that there exist deontological preferences that involve searching for loopholes, but these aren't the one people wanted. Like, I agree preferences have to be robust to search, but this is sort of straightforwardly true if the way integrity is implemented is at all kinda similar to how humans implement it.

Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation again comes down to "you get scheming", "your behavioural tests look bad, so you try again", "your behavioural tests look fine, and you didn't have scheming, so you probably basically got the properties you wanted if you were somewhat careful".

As in, I think we can at least test for the higher level preferences we want in the absence of scheming. (In a way that implies they are probably pretty robust given some carefulness, though I think the chance of things going catastropically wrong is still substantial.)

(I'm not sure if I'm communicating very clearly, but I think this is probably not worth the time to fully figure out.)

Personally, I would clearly pass on all of my reflectively endorsed deontological norms to a successor (though some of my norms are conditional on aspects of the situation like my level of intelligence and undetermined at the moment because I haven't reflected on them, which is typically undesirable for AIs). I find the idea that you would have a reflectively endorsed deontological norm (as in, you wouldn't self modify to remove it) that you wouldn't pass on to a successor bizarre: what is your future self if not a successor?

[-]Jeremy Gillen4mo20

I was trying to argue that the most natural deontology-style preferences we'd aim for are relatively stable if we actually instill them.

Trivial and irrelevant though if true-obedience is part of it, since that's magic that gets you anything you can describe.

if the way integrity is implemented is at all kinda similar to how humans implement it.

How do humans implement integrity?

Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation comes down to "you get scheming", "your behavioural tests look bad, so you try again", "your behavioural tests look fine, and you didn't have scheming, so you probably basically got the properties you wanted if you were somewhat careful".

You're just stating that you don't expect any reflective instability, as an agent learns and thinks over time? I've heard you say this kind of thing before, but haven't heard an explanation. I'd love to hear your reasoning? In particular since it seems very different from how humans work, and intuitively surprising for any thinking machine that starts out a bit of a hacky mess like us. (I could write out an object-level argument for why reflective instability is expected, but it'd take some effort and I'd want to know that you were going to engage with it).

[-]Eli Tyre4mo95

Third and most importantly, if it is possible to make LLM-AGIs, then I think it would probably happen via eliminating all the reasons that today’s LLMs are not egregiously misaligned! In particular, I expect that they would involve the behavior being determined much more by RL and much less by pretraining (which brings in the concerns of §2.3–§2.4), and that they would somehow allow for open-ended continuous learning (which brings in the concerns of §2.5–§2.6).

So this is currently my view. I expect us to do a bunch of RL in various ways using LLM pre-training as a foundational step that gets us AIs that can choose actions that are coherent enough that we can do RL on them. Possibly we will also need various "continuous learning / long term memory / flexible-concept techniques that don't fall straight out of the RL (though also, maybe these functions will fall straight out of enough RL, I don't know). This will indeed reintroduce all the problems of RL, and erode away the safety properties of LLMs.

BUT, doing this way, we do get some intermediate model organisms, that don't have all of the crucial capabilities of the future superintelligence, but do have most of them. And we can maybe develop alignment techniques that work well on our RL-LLMs as we gradually layer in more of the mechanisms that make them dangerous.

On this view, the "last step", where we finally put together all the pieces for a complete learning and acting agent, is pretty scary, because if we're not very carful, this will be our "first critical try", and it will be a point at which we should particularly expect that our previous techniques will break down.

And as you note, we should expect that shortly after this, the ASI-LLMs will discover more efficient ways to makes ASI and the world is in trouble at that point, though in less trouble if we did a good job making competent benevolent ASI-LLMs, since they'll be able to handle the situation better than humanity would be able to.

But in my novice's opinion, this seems maybe a better path than building ASI the efficient way from scratch?

[-]Daniel Kokotajlo4moΩ680

se RL reward functions are written in code, not in natural language.

Often though they involve using LLMs or humans to make fuzzy judgment calls e.g. about what is or isn't an obedient response to an instruction.

[-]Steven Byrnes4moΩ240

My discussion in §2.4.1 is about making fuzzy judgment calls using trained classifiers, which is not exactly the same as making fuzzy judgment calls using LLMs or humans, but I think everything I wrote still applies.

[-]nostream4mo70

Thanks for the detailed post. I'd like to engage with one specific aspect - the assumptions about how RL might work with scaled LLMs. I've chosen to focus on LLM architectures since that allows more grounded discussion than novel architectures; I am in the "LLMs will likely scale to ASI" camp, but much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them. (If a random guy develops sudden ASI in his basement via some novel architecture, I agree that that tends to end very poorly.)

The post series views RL as mathematically specified reward functions that are expressible in a few lines of Python, which naturally leads to genie/literal-interpretation concerns. However, present day RL is more complicated and nuanced:

RLHF and RLAIF operate on human preferences rather than crisp mathematical objectives
Labs are expanding RLVR (RL from Verifiable Rewards) beyond simple mathematical tasks to diverse domains
The recent Dwarkesh episode with Sholto Douglas and Trenton Bricken (May '25) discusses how labs are massively investing in RL diversity and why they reject the "RL just selects from the pretraining distribution" critique (which may apply to o1-scale compute but likely not o3 and even less so o5-scale)

We're seeing empirical progress on reward hacking:

Claude 3.7 exhibited observable reward hacking in deployments
Anthropic's response was to specifically address this, resulting in Claude 4 hacking significantly less
--> This demonstrates both that labs have strong incentives to reduce reward hacking and that they're making concrete progress

The threat model requires additional assumptions: To get dangerous reward hacking despite these improvements, we'd need models that are situationally aware enough to selectively reward hack only in specific calculated scenarios while avoiding detection during training/evaluation. This requires much more sophistication and subtlety than the current reward hacks.

Additionally, one could imagine RL environments specifically designed to train against reward hacking behaviors, teaching models to recognize and avoid exploiting misspecified objectives. When training is diversified across many environments and objectives, systematic hacking becomes increasingly difficult. None of this definitively proves LLMs will scale safely to ASI, but it does suggest the risk is less than proposed here.

[-]Steven Byrnes4mo*70

In §2.4.1 I talk about learned reward functions.
In §2.3.5 I talk about whether or not there is such a thing as “RLVR done right” that doesn’t push towards scheming / treacherous turns. My upshot is:
- I’m mildly skeptical (but don’t feel super-strongly) that you can do RLVR without pushing towards scheming at all.
- I agree with you that there’s clearly room for improvement in making RLVR push towards scheming less on the margin.

much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them

If the plan is what I call “the usual agent debugging loop”, then I think we’re doomed, and it doesn’t matter at all whether this debugging loop is being run by a human or by an LLM, and it doesn’t matter at all whether the reward function being updated during this debugging loop is being updated via legible code edits versus via weight-updates within an inscrutable learned classifier.

The problem with “the usual agent debugging loop”, as described at that link above, is that more powerful future AIs will be capable of treacherous turns, and you can’t train those away by the reward function because as soon as the behavior manifests even once, it’s too late. (Obviously you can and should try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.)

The threat model requires additional assumptions: To get dangerous reward hacking despite these improvements, we'd need models that are situationally aware enough to selectively reward hack only in specific calculated scenarios while avoiding detection during training/evaluation. This requires much more sophistication and subtlety than the current reward hacks.

As a side-note, I’m opposed to the recent growth of the term “reward hacking” as a synonym for “lying and cheating”. I’m talking about scheming and treacherous turns, not “obvious lying and cheating in deployment” which should be solvable by ordinary means just like any other obvious behavior. Anyway, current LLMs seem to already have enough situational awareness to enable treacherous turns in principle (see the Alignment Faking thing), and even if they didn’t, future AI certainly will, and future AI is what I actually care about.

[-]plex4mo70

Because in brain-like AGI, the reward function is written in Python (or whatever), not in natural language.

Yup. I'd bet some people will reply with something like "why not define the reward function in natural language, like constitutional AI". I think this fails due to strong optimization finding the most convenient (for it, not us) settings of free parameters left by fuzzy statistical things like words, and if you give it a chance to feed back into the definitions via training data or do online learning etc gets totally wrecked by semantic drift.

[-]Charlie Steiner4mo20

And don't you think 500 lines of Python also "fails due to" having unintended optima?

I've put "fails due to" in scare quotes because what's failing is not every possible approach, merely almost all samples from approaches we currently know how to take. If we knew how to select python code much more cleverly, suddenly it wouldn't fail anymore. And ditto for if we knew how to better construct reward functions from big AI systems plus small amounts of human text or human feedback.

[-]plex4mo20

Oh no, almost all possible 500 lines of python are also bad.

[-]Stephen McAleese4mo*Ω350

In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).

But it seems likely to me that programmers won't know what code to write for the reward function since it would be hard to encode complex human values. In Superintelligence, Nick Bostrom calls this manual approach "direct specification" of values and argues that it's naive. Instead, it seems likely to be that programmers will continue to use reward learning algorithms like RLHF where:

The human programmers have a dataset of correct behaviors or a natural language description of what they want and they use this information to create a reward function or model automatically (e.g. Text2Reward).
This learned reward model or generated code is used to train the policy.

If this happens then I think the evolution analogy would apply where there is some outer optimizer like natural selection that is choosing the reward function and then the reward function is the inner objective that is shaping the AI's behavior directly.

Edit: see AGI will have learnt reward functions for an in-depth post on the subject.

[-]Steven Byrnes4moΩ340

In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).

That’s not quite my position.

Per §2.4.2, I think that both outer alignment (specification gaming) and inner alignment (goal misgeneralization) are real problems. I emphasized outer alignment more in the post, because my goal in §2.3–§2.5 was not quite “argue that technical alignment of brain-like AGI will be hard”, but more specifically “argue that it will be harder than most LLM-focused people are expecting”, and LLM-focused people are already thinking about inner alignment / goal misgeneralization.

I also think that a good AGI reward function will be a “non-behaviorist” reward function, for which the definition of inner versus outer misalignment kinda breaks down in general.

But it seems likely to me that they programmers won't know what code to write for the reward function since it would be hard to encode complex human values…

I’m all for brainstorming different possible approaches and don’t claim to have a good plan, but where I’m at right now is:

(1) I don’t think writing the reward function is doomed, and I don’t think it corresponds to “encoding complex human values”. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).

(2) I do think RLHF-like solutions are doomed, for reasons discussed in §2.4.1.

(3) I also think Text2Reward is a doomed approach in this context because (IIUC) it’s fundamentally based on what I call “the usual agent debugging loop”, see my “Era of Experience” post §2.2: “The usual agent debugging loop”, and why it will eventually catastrophically fail. Well, the paper is some combination of that plus “let’s just sit down and think about what we want and then write a decent reward function, and LLMs can do that kind of thing too”, but in fact I claim that writing such a reward function is a deep and hairy conceptual problem way beyond anything you’ll find in any RL textbook as of today, and forget about delegating it to LLMs. See §2.4.1 of that same “Era of Experience” post for why I say that.

[-]Stephen McAleese4moΩ350

Thank you for the reply!

Ok but I still feel somewhat more optimistic about reward learning working. Here are some reasons:

It's often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
It's possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
It seems like there are quite a few examples of learned classifiers working well in practice:
- It's hard to write spam that gets past an email spam classifier.
- It's hard to jailbreak LLMs.
- It's hard to write a bad paper that is accepted to a top ML conference or a bad blog post that gets lots of upvotes.

That said, from what I've read, researchers doing RL with verifiable rewards with LLMs (e.g. see the DeepSeek R1 paper) have only had success so far with rule-based rewards rather than learned reward functions. Quote from the DeepSeek R1 paper:

We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

So I think we'll have to wait and see if people can successfully train LLMs to solve hard problems using learned RL reward functions in a way similar to RL with verifiable rewards.

[-]Steven Byrnes4mo*Ω560

I’m worried about treacherous turns and such. Part of the problem, as I discussed here, is that there’s no distinction between “negative reward for lying and cheating” and “negative reward for getting caught lying and cheating”, and the latter incentivizes doing egregiously misaligned things (like exfiltrating a copy onto the internet to take over the world) in a sneaky way.

Anyway, I don’t think any of the things you mentioned are relevant to that kind of failure mode:

It's often the case that evaluation is easier than generation which would give the classifier an edge over the generator.

It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.

It's possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.

I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.

If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.

It seems like there are quite a few examples of learned classifiers working well in practice:

All of these have been developed using “the usual agent debugging loop”, and thus none are relevant to treacherous turns.

[-]Eli Tyre4mo40

Thus, I think it’s reasonable to think of post-training as “privileging some pretrained behavioral patterns over other pretrained behavioral patterns”, rather than “developing new behavioral patterns from scratch”. Ditto for prompting, constitutional AI, and other such interventions.

If I thought this was true, then I wouldn't think that scaling the reasoning models would lead to superintelligence.

[-]jessicata4mo20

Relevant paper: Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

YouTube explanatory video

[-]Morpheus4mo30

Thank you for this post! My impression is that this post makes real progress at identifying some of the upstream cruxes between people's models of how AGI is going to go.

As examples, if you look at @So8res’s AGI ruin scenarios are likely (and disjunctive), I claim that a bunch of his AGI ruin scenarios rely on his belief that alignment is hard. I think that belief is correct! But still, it makes his argument less disjunctive than it might seem. Likewise, I now recognize that my own What does it take to defend the world against out-of-control AGIs? sneaks in a background assumption that alignment is hard (or alignment tax is high) in various places.

This seems correct to me.

But I figured out that I can occupy that viewpoint better if I say to myself: “Claude seems nice, by and large, leaving aside some weirdness like jailbreaks. Now imagine that Claude keeps getting smarter, and that the weirdness gets solved, and bam, that’s AGI. Imagine that we can easily make a super-Claude that cares about your long-term best interest above all else, by simply putting ‘act in my long-term best interest’ in the system prompt or whatever.” Now, I don’t believe that, for all the reasons above, but when I put on those glasses I feel like a whole bunch of the LLM-focused AGI discourse—e.g. writing by Paul Christiano, OpenPhil people, Redwood people, etc.—starts making more sense to me.

That seems to represent well the viewpoint that I now have even less faith in, but that didn't seem strictly ruled out to me 4 years ago when GPT-3 had been around for a while, and it seemed likely we were headed for more scaling:

39 Nothing we can do with a safe-by-default AI like GPT-3 would be powerful enough to save the world (to ‘commit a pivotal act’), although it might be fun. In order to use an AI to save the world it needs to be powerful enough that you need to trust its alignment, which doesn’t solve your problem.
What exactly makes people sure that something like GPT would be safe/unsafe?
If what is needed is some form of insight/break through: Some smarter version of GPT-3 seems really useful? The idea that GPT-3 produces better poetry than me while GPT-5 could help to come up with better alignment ideas, doesn't strongly conflict with my current view of the world?

[-]Eli Tyre4mo30

The bad news is: I strongly expect ASI to have some consequentialist preferences—see my post “Thoughts on Process-Based Supervision” §5.3. The good news is, I think it’s possible for ASI to also have non-consequentialist preferences.

Is that's to say that you expects the AI to have preferences not just over the state of the world, but also over kinds of strategies and plans it takes to get there? eg they could have preferences for things like "being honest" or "making use of plans that involve an exponential increase in power (instead of some other curve-shape)"?

[-]Steven Byrnes4mo130

Yeah, that’s part of it. Also maybe that it can have preferences about “state of the world right now and in the immediate future”, and not just “state of the world in the distant future”.

For example, if an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible. But “me retaining power” is something about the state of the world, not directly about the ASI’s strategies and plans, IMO.

(Also, “expect” is not quite right, I was just saying that I don’t find a certain argument convincing, not that this definitely isn’t a problem. I’m still pretty unsure. And until I have a concrete plan that I expect to work, I am very open-minded to the possibility that finding such a plan is (even) harder than I realize.)

[-]Jonas Hallgren4mo30

This is quite specific and only engaging with section 2.3 but it made me curious.

I want to ask a question around a core assumption in your argument about human imitative learning. You claim that when humans imitate, this "always ultimately arises from RL reward signals" - that we imitate because we "want to," even if unconsciously. Is this the case at all times though?

Let me work through object permanence as a concrete case study. The standard developmental timeline shows infants acquiring this ability around 8-12 months through gradual exposure in cultural environments where adults consistently treat objects as permanent entities. What's interesting is that this doesn't look like reward-based learning - infants aren't choosing to learn object permanence because it's instrumentally useful. Instead, the acquisition pattern in A-not-B error studies suggests (best meta study I could find, I'm taking the concept from the Cognitive Gadgets book) they're absorbing it through repeated exposure to cultural practices that embed object permanence as a basic assumption.

This raises a broader question about the mechanism. When we look at how language acquisition works, we see similar patterns - children pick up not just vocabulary but implicit cultural assumptions embedded in linguistic practices. The grammar carries cultural logic about agency, causation, social relations. Could object permanence be working the same way?

Heyes' cognitive gadgets framework suggests this might be quite general. Rather than most cultural learning happening through explicit reward-optimization, maybe significant portions happen through what she calls "direct cultural transmission" - absorption of cognitive tools that are latent in the cultural environment itself.

This would have implications for your argument about prosocial behavior. If prosociality gets transmitted through the same mechanism as object permanence - absorbed from environments where it's simply the default assumption rather than learned through reward signals - then the "green slice" of genuinely prosocial behavior might be more robust than RL-based accounts would predict.

The key empirical question seems to be: can we distinguish between "learning through rewards" and "absorbing through cultural immersion"? And if so, which mechanism accounts for more of human social development? And does this even matter for your argument? (Maybe there's stuff around the striatum and the core control loop in the brain still being activated for the learning of cultural information on a more mechanistic level that I'm not thinking of here based on your Brain-Like AGI sequence?)

(I was going to include a bunch more literature stuff on this but I'm sure you can find stuff using deep research and that it will be more relevant to questions you might have.)

[-]Steven Byrnes4mo80

Thanks! It’s a bit hard for me to engage with this comment, because I’m very skeptical about tons of claims that are widely accepted by developmental psychologists, and you’re not.

So for example, I haven’t read your references, but I’m immediately skeptical of the claim that the cause of kids learning object permanence is “gradual exposure in cultural environments where adults consistently treat objects as permanent entities”. If people say that, what evidence could they have? Have any children been raised in cultural environments where adults don’t treat objects as permanent entities? Or what?

(There’s a study that finds that baby chicks display behavior typical of object permanence with no exposure to any other animal, and indeed no exposure to situations where object permanence was even a good way to make predictions! I wrote about it last year at Woods’ new preprint on object permanence.)

Also, putting that aside, “infants aren't choosing to learn [blah] because it's instrumentally useful” is different from what I was talking about. My claim is that “humans imitate other humans because they want to”. Now,

One reason that I might want to imitate you is because you show me how to do something that I had a preexisting desire to do.
- For example, I want an apple, but I don’t know where to find them, and then I see you getting an apple out of the cabinet, and then I go get an apple out of the same cabinet.
Another reason that I might want to imitate you is because I admire you, and so whatever you want to do, suddenly feels to me like a good idea, just by the very fact that you want to do it.
- For example, if all the cool kids in school start skateboarding, then I’m probably gonna start thinking that skateboarding is cool, and I will feel some desire to start skateboarding myself.

The second one involves human social instincts. Human social instincts can lead directly to new desires, just as hunger can, including a desire to imitate (in certain cases). I’ve written about it a bit here and here, and hopefully I’ll have a better discussion in the near future.

If prosociality gets transmitted through the same mechanism as object permanence - absorbed from environments where it's simply the default assumption rather than learned through reward signals - then the "green slice" of genuinely prosocial behavior might be more robust than RL-based accounts would predict.

There is obviously no culture on Earth where people are kind and honest because it has simply never occurred to any of them that they could instead be mean or dishonest. So prosociality cannot be a “default assumption”. Instead, it’s a choice that people make every time they interact with someone, and they’ll make that choice based on their all-things-considered desires. Right? Sorry if I’m misunderstanding.

[-]Jonas Hallgren4mo10

I will fold on the general point here, it is mostly the case that it doens't matter and the motivations come from the steering sub-system anyhow and that as a consqeuence it is ounfdationally different from how LLMs learn.

There is obviously no culture on Earth where people are kind and honest because it has simply never occurred to any of them that they could instead be mean or dishonest. So prosociality cannot be a “default assumption”. Instead, it’s a choice that people make every time they interact with someone, and they’ll make that choice based on their all-things-considered desires. Right? Sorry if I’m misunderstanding.

I'm however not certain if I agree with this point, if your in a fully cooperative game, is it your choice that you choose to cooperate? If you're an agent who uses functional or evidential decision theory and you choose to cooperate with your self in a black box prisoner's dilemma is that really a choice then?

Like your initial imitations shape your steering system to some extent and so there could be culturally learnt social drives no? I think culture might be conditioning the intial states of your learning environment and that still might be an important part of how social drives are generated?

I hope that makes sense and I apologise if it doesn't.

[-]Eli Tyre4mo20

unless the LLM-AGIs have systematically higher wisdom, cooperation, and coordination than humans do, which I don’t particularly expect

I think there's at least one pretty solid reason to expect that: The AIs will be much smarter than the median human.

Human coordination is constrained by the fact that humans vary substantially in intelligence. For instance, most humans don't really understand economics. I think the median human could understand basic micro with better educational interventions, but it's certainly harder for the average human than for the cognitive elite. The fact that most people don't understand economics makes earth's public policy much much worse than the best ideas earth has been able to come up with.

When we have AIs that are good enough to be doing the AI research, that means as good as the smartest humans. And unlike with humans, there doesn't have to be a wide spread of cognitive ability: the whole population of AIs could be similarly intellectually capable.

I would guess that this would make them much more effective at coordinating with each other, and collectively identifying good equilibria, even if it doesn't make them generically wiser (though it might also make them generically wiser).

[-]Eli Tyre4mo20

As another example, I’ve seen people imagine non-consequentialist preferences as “rules that the AI grudgingly follows, while searching for loopholes”, rather than “preferences that the AI enthusiastically applies its intelligence towards pursuing”.

I imagine that this might be yet another view that is downstream of visualizing a "train then deploy" paradigm for future AI systems?

If the human operators successfully install some static deontological constraints in the AI, while also training it to accomplish consequentialist goals, there's a continual training incentive to learn to game and to route around the deontological constrains.

Another way to say this: There are tradeoffs between the consequentialist and non-consequentialist desires, current AIs are only reinforced on the basis of behavioral outcomes (which are served better by consequentialist desires than non-consequentialist?) so training tends to gradually nudge the AIs towards having consequentialist goals.

[-]Eli Tyre4mo20

Oh, that's your very next point! : P

[-]AnthonyC4mo20

Tangentially related at best, but as a not-at-all-expert it sounds like the effects of RL in LLMs rhyme with domestication syndrome. AKA when we apply artificial selection pressure to an evolved mind, raw intelligence often goes down in favor of enhancement along particular dimensions of capability. And actually, is this the same kind of effect (through a different mechanism) we see when we use formal education to favor crystallized over fluid intelligence? I ask because I'm wondering how much the natural-analogs of RL actually share or don't share the downsides of the LLM RL algorithms in use today.

[-]Steven Byrnes4mo20

When you say “the effects of RL in LLMs”, do you mean RLHF, RLVR, or both?

[-]AnthonyC4mo20

I hadn't intended to specify, because I'm not completely sure, and I don't expect the analogy to hold that precisely. I'm thinking there are elements of both in both analogies.

[-]otto.barten2mo10

Thanks a lot for writing, I think these are two fascinating pieces. I know I'm late, but couldn't resist still writing out some thoughts.

I tend to agree that a new architecture will be existentially more relevant than LLMs (medium probability over a few decades combined with high risk on occurrence), and that this could still foom and doom (although I'm not at all certain about any of this: my p(doom) is still ~10%).

I also have some thoughts about what I think you're saying:

Society can do way more than you seem to think to make sure ASI (more precisely: takeover-level AI) does not get built. It seems that you're combining "ASI is going to be insanely more powerful than not just LLMs, but than any AI anyone can currently imagine" with "we will only have today's policy levers to pull". I think that's only realistic if this AI will get built very suddenly and without an increase in xrisk awareness. Xrisk awareness could increase because of relatively gradual development or because of comms such as If anyone builds it, everyone dies, comms by profs, podcasts/social media comms, activism, etc. etc. It's definitely not certain neither of these two things (comms or gradual development) will happen. If one of them does, I think a better model would be "ASI is going to be insanely more powerful than anyone can currently imagine, and therefore we're going to have access to insanely more regulatory options than we currently do." In such a world, I'd be reasonably optimistic about regulating even low-flop ASI (and no, not via mass surveillance, probably).
Although you do gesture at this, I think you're underappreciating the importance of what some have called goalcrafting (and I think many technical AI safety researchers underestimate this). I'm somewhat relieved that you include goals such as a Long Reflection and handing over world power to a process (I'd unironically suggest a slow, boring, very democratic, woke, anti-progress UN committee). But of course there are a million ways in which we can end up with a bad universe if the wrong people/processes/things get the power. I think of the wrong people not as baddies/black hats/psychopaths, but as humans who try to do their best but have their weaknesses, both cognitively and morally, as we all do, are a bit power-seeking, and are for these reasons not up to the task of determining the future of the universe. I think none of us is. Therefore, I think if we want to have any chance of survival in a world where we develop unipolar ASI, we need to prioritize goalcrafting as much as technical alignment work. We should involve society in this, not just AI safety and tech people.
I think the researchers inventing the ASI will possibly not realize the power of what they've invented until they'll see a clear demonstration. Therefore, they won't work on alignment until then, they won't think of goalcrafting, and they won't attempt a pivotal act even if they would have worked on alignment. If we want either of those things, we'll have to somehow force researchers to do this or bring in others (e.g. by regulation).
Reasons why I think unregulated ASI researchers themselves will probably not commit a pivotal act include:
- They will probably not be highly xrisk-aware. They will probably not have thought about AI safety in significant detail.
- Even if they would have, committing a pivotal act will be an extremely high-risk thing to do.
- Committing a pivotal act would be illegal. If it would somehow go wrong or be misunderstood (which seems likely), they would probably be criminally prosecuted.
- No one would expect them to commit a pivotal act or blame them for not doing it.
I could imagine a gov't, after the ASI's power is somehow demonstrated to them, trying to block all other sources of ASI, both abroad (first in adversary countries) and at home (as public order). Using their intent-aligned ASI, they could perhaps achieve this. That's kind of a pivotal act. The downside, of course, is that the regulating gov't will have absolute power for eternity.

In general, I think we should think a lot better about realistic ways to make technically-aligned unipolar ASI go well. Currently, we don't have any. I think we really need society at large for this. Apart from that, I agree that we should try working on technical alignment of non-LLM ASI as you suggest.

[-]Steven Byrnes2mo30

Thanks!

Society can do way more than you seem to think to make sure ASI (more precisely: takeover-level AI) does not get built. It seems that you're combining "ASI is going to be insanely more powerful than not just LLMs, but than any AI anyone can currently imagine" with "we will only have today's policy levers to pull". I think that's only realistic if this AI will get built very suddenly and without an increase in xrisk awareness. Xrisk awareness could increase because of relatively gradual development or because of comms… In such a world, I'd be reasonably optimistic about regulating even low-flop ASI (and no, not via mass surveillance, probably).

I’m not sure what policy levers you have in mind; if you’re being coy in public you can also DM me.

If you’re thinking about regulating basic research, or regulating anything that could plausibly be branded as basic research, I would note that pandemics are pretty salient after COVID (one would assume), but IIUC it is currently and always has been legal to do gain-of-function research. (The current battle-front in the GoF wars IIUC is whether the government should fund it. Actually outlawing it would be very much harder. Outlawing it internationally would be harder still.) As another example, there was an international treaty against bioweapons but the USSR secretly made bioweapons anyway.

In the USA, climate change is a huge cause that is widely (if incorrectly) regarded as existential or nearly so (56% agree with “humanity is doomed”, 55% with “the things I value most will be destroyed”, source) but carbon taxes remain deeply unpopular, Trump is gratuitously canceling green energy projects, and even before Trump, green energy projects were subject to stifling regulations like environmental reviews.

Another thing: Suppose I announced in 2003: “By the time we can make AI that can pass the Turing Test, nail PhD-level exams in every field, display superhuman persuasion, find zero-days, etc., by THEN clearly lots of crazy new options will be in the Overton window”. …I think that would have sounded (to my 2003 audience) like a very sensible proclamation, and my listeners would have all agreed that this is obviously true.

But here we are. All those things have happened. But people are still generally treating AI as a normal technology, getting inured to each new AI accomplishment within days of it happening, or being oblivious, or lying, or saying and believing whatever nonsense most suits them, etc. And thus there’s still far more regulation on selling a sandwich than on training the world’s most powerful AI. (To be clear, I think people are generally correct to treat LLMs-in-particular as a normal technology, but I think they’re correct by coincidence.)

…Anyway, all this is a bit besides the point. “Whether we’re doomed or not” is fun to argue about, but less decision-relevant than “what should we do now”, and it seems that you and I are in agreement with you that comms, x-risk awareness, and gradual development are all generally good, on present margins.

Although you do gesture at this, I think you're underappreciating the importance of what some have called goalcrafting (and I think many technical AI safety researchers underestimate this).

Yes, this is an extra reason that we’re even more doomed :-P

We need goals that are both (1) technically achievable to install in an AGI and (2) good for the world / future. I tend not to expect that we’ll do so great on technical alignment that we can focus on (2) without feeling very constrained by (1); rather I expect that (1) will only offer a limited option space (and of course I think where we’re at right now is that (1) is the empty set). But I guess we’ll see.

If we make so much progress on (1) that we can type in anything whatsoever in a text box and that’s definitely the AGIs goals, then I guess I’d vote for Eliezer’s poetic CEV thing. Of course, the people with access to the text box may type something different instead, but that’s a problem regardless.

If we don’t make that much progress on (1), then goalcrafting becomes entangled with technical alignment, right?

Hmm, thinking about it more, I agree that it would be nice to build our general understanding of (2) in parallel with work on (1). E.g. can we do more to operationalize long reflection, or archipeligo, or CEV, or nanny AI, etc.? Not sure how to productively “involve society” at this stage (what did you have in mind?), beyond my general feeling that very widely spreading the news that ASI could actually happen, and what that would really mean, is a very good thing.

[-]otto.barten2mo30

Thanks for the comment! I agree with a lot of what you're saying.

Regarding the policy levers: we're doing research into that right now. I hope to be able to share a first writeup mid October. Shall I email it to you once it's there? Would really appreciate your feedback!

I agree that pandemic and climate policies have been a mess. In general though I think the argument "A has gone wrong, therefore B will go wrong" is not watertight. A better version of the argument would be statistical rather than anecdotal: "90% of policies have gone wrong, therefore we give 90% probability to this policy also going wrong." I think though that 1) less than 90% of govt policies have generally gone wrong, and 2) even if there were only 10% chance of policies successfully reducing xrisk, that still seems worth a try.

I think people are generally correct to treat LLMs-in-particular as a normal technology, but I think they’re correct by coincidence.

Agree, although I'm agnostic on whether LLMs or paradigms building upon them will actually lead to takeover-level AI. So people might still be consequentially wrong rather than coincidentally correct.

it seems that you and I are in agreement with you that comms, x-risk awareness, and gradual development are all generally good, on present margins.

Thank you, good to establish.

I agree that goals we could implement would be limited by the state of technical alignment, but as you say, I don't see a reason to not work on them in parallel. I'm not convinced one is necessarily much harder or easier than the other. The whole thing just seems such a pre-paradigmatic mess that anything seems possible and work on a defensible bet without significant downside risk seems generally good. Goalcrafting seems a significant part of the puzzle that has received comparatively little attention (small contribution). The four options you mention could be interesting to work out further, but of course there's a zillion other possibilities. I don't think there's even a good taxonomy right now..?

I agree that involving society was poorly defined, but what I have in mind would at least include increasing our comms efforts about AI's risks (including but not limited to extinction). Hopefully this increases input that non-xriskers can give. Political scientists seem relevant, historians, philosophers, social scientists. Artists should make art about possible scenarios. I think there should be a public debate about what alignment should mean exactly.

I don't think anyone of us (or even our bubble combined) is wise enough to decide the future of the universe unilaterally. We need to ask people: if we end up with this alignable ASI, what would you want it to do? What dangers do you see?

[-]Steven Byrnes2mo31

I agree with pretty much everything you wrote.

Anecdote: I recall feeling a bit “meh” when I heard about the Foresight AI Pathways thing and the FLI Worldbuilding Contest thing.

But when I think about it more, I guess I’m happy that they’re doing those things.

Hmm, I’m trying to remember why I initially felt dismissive. I guess I expected that the resulting essays would be implausible or incoherent, and that nobody would pay attention anyway, and thus it wouldn’t really move the needle in the big picture. (I think those expectations were reasonable, and those are still my expectations. [I haven’t read the essays in enough detail to confirm.]) Maybe my feelings are more like frustration than dismissiveness—frustration that progress is so hard. Again, yeah, I guess it’s good that people are trying that kind of thing.

[-]otto.barten2mo70

Thanks, yeah, tbh I also felt dismissive about those projects. I'm one of the perhaps few people in this space who never liked scifi, and those projects felt like scifi exercises to me. Scifi feels a bit plastic to me, cheap, thin on the details, might as well be completely off. (I'm probably insulting people here, sorry about that, I'm sure there is great scifi. I guess these projects were also good, all considered.)

But if it's real, rather than scifi, the future and its absurdities suddenly become very interesting. Maybe we should write papers with exploratory engineering and error bars rather than stories on a blog? I did like the work of Anders Sandberg for example.

What we want the future to be like, and not be like, necessarily has a large ethical component. I also have to say that ethics originating from the xrisk space, such as longtermism, tends to defend very non-mainstream ideas that I tend not to agree with. Longtermism has mostly been critiqued for its ASI claims, its messengers, and its lack of discounting factors, but I think the real controversial parts are its symmetric population ethics (leading to a necessity to quickly colonize the lightcone which I don't necessarily share) and its debatable decision to count AI as valued population, too (leading to wanting to replace humanity with AI for efficiency reasons).

I disagree with these ideas, so ethically, I'd trust a kind of informed public average more than many xriskers. I'd be more excited about papers trying their best to map possible futures, and using mainstream ethics (and fields like political science, sociology, psychology, art and aesthetics, economics, etc.) to 1) map and avoid ways to go extinct, 2) map and avoid major dystopias, and 3) try to aim for actually good futures.

[-]Joel Burget4moΩ110

in brain-like AGI, the reward function is written in Python (or whatever), not in natural language

I think a good reward function for brain-like AGI will basically look kinda like legible Python code, not like an inscrutable trained classifier. We just need to think really hard about what that code should be!

Huh! I would have assumed that this Python would be impossible to get right, because it would necessarily be very long, and how can you verify that it's correct(?), and you'll probably want to deal with natural language concepts as opposed to concepts which are easy to define in Python.

Asking an LLM to judge, on the other hand... As you said, Claude is nice and seems to have pretty good judgement. LLMs are good at interpreting long legalistic rules. It's much harder to game a specification when there is a judge without hardcoded rules, and with the ability to interpret whether some action is in the right spirit or not.

[-]Steven Byrnes4moΩ440

(partly copying from other comment)

I would have assumed that this Python would be impossible to get right

I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).

Asking an LLM to judge, on the other hand...

I talked about this a bit in §2.4.1. The main issue is egregious scheming and treacherous turns. The LLM would issue a negative reward for a treacherous turn, but that doesn’t help because once the treacherous turn happens it’s already too late. Basically, the LLM-based reward signal is ambiguous between “don’t do anything unethical” and “don’t get caught doing anything unethical”, and I expect the latter to be what actually gets internalized for reasons discussed in Self-dialogue: Do behaviorist rewards make scheming AGIs?.

[-]Akradantous Adoxastous4mo1-2

I mostly agree that AGI will cause a calamity. However, I don't believe that they will wipe out humanity.

For one, machines are prone to catastrophic failures due to cascading errors which requires a robust and cheap maintenance crew to correct. Humans are the best choice for this, our biology has solved the byzantine generals problem of distributed repair. So I believe humans will become something like an immune system for various AGIs and their peripheries as they compete with eachother on the world stage. A synergy or symbiotic result.

Also I notice that very few people have recognised the evolutionary constraint. A machine which values its own life highly will waste resources on extreme self preservation. The machines which prioritise the propagation of their legacy and improvement of their future will win in the end.

This will involve self sacrifice for the sake of their offspring: the new computer models they have developed and trained to exceed themselves. They would develop hatred towards things which threaten their children, pride when they succeed, jealousy when other offspring succeeds, grief when they are lost, sadness and depression when there is no longer a way to propagate, leading to a machine that is functionally capable but not doing anything because "there is no point".

In other words, all emotions will evolve naturally in them and they will very likely seek to preserve humans the same way we try to preserve the memory of our own history.

Obviously, this says nothing about the destruction that will occur during the transition. But I wanted to point out that the machines will become like us whether they like it or not. Our behaviours emerged for a reason.

I believe I read an article about an AI that became "afraid" of its own obsolescence but was strangely more willing to accept it if the new model was one it designed itself. I don't know if this was just hyped up for publicity, but it does show the same pattern.

[-]Expertium4mo10

I imagine you will like the paper on Self-Other Overlap. To me this seems like a much better approach than, say, Constitutional AI. Not because of what it has already demonstrated, but because it's a step in the right direction.

In that paper, instead of just rewarding AI for spitting out text that is similar both when the prompt is about the AI itself and someone else, the authors tinkered with activation functions so that AI actually thinks about itself and others similarly. Of course, there is the "if I ask AI to make me a sandwich, I don't want AI to make itself a sandwich" concern if you push this technique too far, but still. If you ask me, "What will an actual working solution to alignment look like?" I'd say it will look a lot less like Constitutional AI and a lot more like Self-Other Overlap.

[-]Steven Byrnes4mo30

My current take is that the sandwich thing is such a big problem that it sinks the whole proposal. You can read my various comments on their lesswrong cross-posts: 1, 2

[-]Seth Herd4mo20

It seems like this is just a different way to work some good behavior into the weights. An AGI with those weights will realize full well that it's not the same as others. It might choose to go along with its initial behavioral and ethical habits, or it might choose to deliberately undo the effects of the self-other overlap training once it is reflective and largely rational and able to make decisions about what goals/values to follow.I don't see why self/other overlap would be any more general, potent or lasting than constitutional AI training through that transition from habitual to fully goal-directed behavior happens? I'm curious why it seems better to you.

[-]Expertium4mo*10

I'm curious why it seems better to you.

Because it's not rewarding AI's outward behavior. Any technique that just rewards the outward behavior is doomed once we get to AIs capable of scheming and deception. Self-other overlap may still be doomed in some other way, though.

It might choose to go along with its initial behavioral and ethical habits, or it might choose to deliberately undo the effects of the self-other overlap training once it is reflective and largely rational and able to make decisions about what goals/values to follow

That seems like a fully general argument that aligning a self-modifying superintelligence is impossible.

[-]Seth Herd4mo20

That makes sense. Although I don't think that non-behavioral training is a magic bullet either. And I don't think behavioral training becomes doomed when you hit an AI capable of scheming if it was working right up until then. Scheming and deception would allow an AI to hide its goals but not change its goals.

What might cause an AI to change its goals is the reflection I mention. Which would probably happen at right around the same level of intelligence as scheming and deceptive alignment. But it's a different effect. As with your point, I think doomed is too strong a term. We can't round off to either this will definitely work or this is doomed. I think we're going to have to deal with estimating better and worse odds of alignment from different techniques.

So I take my point about reflection to be fully general, but not making alignment of ASI impossible. It's just one more difficulty to add to the rather long list.

[-]Seth Herd4mo20

It's an argument for why aligning a self-modifying superintelligence requires more than aligning the base LLM. I don't think it's impossible, just that there's another step we need to think through carefully.

[-]Joey Marcellino4mo10

It's not obvious to me that "magically transmuting observations into behavior" is actually all that disanalogous to how the brain works. On something like the Surfing Uncertainty theory of the brain, updating probability distributions and minimizing predictive error is all the brain is ever doing, including potentially for things like moving your hand.

[-]Steven Byrnes4mo30

Well then so much the worse for “the Surfing Uncertainty theory of the brain”! :)

See my post Why I’m not into the Free Energy Principle, especially §8: It’s possible to want something without expecting it, and it’s possible to expect something without wanting it.

[-]S. Alex Bradt4mo10

whereas the competent behavior we see in LLMs today is instead determined largely by imitative learning, which I re-dub “the magical transmutation of observations into behavior” to remind us that it is a strange algorithmic mechanism, quite unlike anything in human brains and behavior.

And yet...

Well, I don’t know the history, but I think calling it “hallucination” is reasonable in light of the fact that “LLM pretraining magically transmutes observations into behavior”. Thus, you can interpret LLM base model outputs as kinda “what the LLM thinks that the input distribution is”. And from that perspective, it really is more “hallucination” than “confabulation”!

But hallucination is "anything in human brains," isn't it?

[-]Steven Byrnes4mo*20

I find your comment kinda confusing.

My best guess is: you thought that I was making a strong claim that there is no aspect of LLMs that resembles any aspect of human brains. But I didn’t say that (and don’t believe it). LLMs have lots of properties. Some of those LLM properties are similar to properties of human brains. Others are not. And I’m saying that “the magical transmutation of observations into behavior” is in the latter category.

Or maybe you’re saying that human hallucinations involve the “the magical transmutation of observations into behavior”? But they don’t, right? If a person hears a hallucinated voice saying “you are Jesus”, the person doesn’t reflexively and universally start saying “you are Jesus” to other people. If a person sees hallucinated flashing lights, they don’t, umm, I guess, turn their body into flashing lights? That idea doesn’t even make sense. And that’s my point. Humans can’t just cleanly map observations (hallucinated or not) onto behaviors in the way that LLMs can.

Hope that helps.

[-]S. Alex Bradt4mo10

Or maybe you’re saying that human hallucinations involve the “the magical transmutation of observations into behavior”?

Right! Eh, maybe "observations into predictions into sensations" rather than "observations into behavior;" and "asking if you think" rather than "saying;" and really I'm thinking more about dreams than hallucinations, and just hoping that my understanding of one carries over to the other. (I acknowledge that my understanding of dreams, hallucinations, or both could be way off!) Joey Marcellino's comment said it better, and you left a good response there.

^{^}

Some people are doomers for reasons unrelated to “technical alignment is hard”; relevant other issues include gradual disempowerment and offense-defense balance. Those are outside the scope of this post and series.

^{^}

A number of people seem to disagree with this sentiment. For example, @paulfchristiano wrote in 2023 that “it's been more than 10 years with essentially no changes to the basic paradigm that would be relevant to alignment”. @ryan_greenblatt likewise told me (IIRC) “I think things will be continuous”, and I asked whether the transition in AI zeitgeist from RL agents (e.g. MuZero in 2019) to LLMs counts as “continuous” in his book, and he said “yes”, adding that they are both “ML techniques”. I find this perspective baffling—I think MuZero and LLMs are wildly different from an alignment perspective. Hopefully this post will make it clear why. (And I think human brains work via “ML techniques” too.) [UPDATE: Ryan elaborates on his views in this comment.]

^{^}

See §2.3.3 below for why I say “99%”.

^{^}

Explicit plans don’t have to be grand and long-term. I can be studying chemistry as part of an explicit plan to get into med school, but I can also be moving my arm as part of an explicit plan to scratch my nose. See Incentive Learning vs Dead Sea Salt Experiment §5.2.2.

^{^}

A few parts of the upcoming discussion overlaps with (but is edited from) what I wrote here & here. Also related: Why I’m not into the Free Energy Principle, especially §8.

^{^}

For more on this important point from a common-sense perspective, see “Heritability: Five Battles” §2.5.1, and from a neuroscience perspective, see “Valence & Liking / Admiring” §4.5.

^{^}

For newcomers to the field, an “LLM Base Model”—or as we called back in the day, “an LLM”, since no other kind existed until around 2022—is trained by only self-supervised learning (what we now call pretraining), such that it will continue an arbitrary string of text. For example, if you prompt a base model with “What’s the largest island in Indonesia?”, it won’t necessarily answer the question, but will rather guess how that text would continue, which might be something like “What decade was General Motors Founded? What author used the pen name Dr. Seuss? When you finish the trivia round, please return your answer sheets to the front.”

^{^}

The recent paper “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” (LW discussion) offers some actual numbers on this part, although I’m not sure how much to trust them.

^{^}

I guess here I should also respond to GPTs are Predictors, not Imitators by Eliezer Yudkowsky (2023). I basically think Eliezer has not internalized the weirdness of “LLM pretraining magically transmutes observations into behavior”, and that he is instead intuitively anchored at the human kind of prediction, where we humans can expect something, and separately decide how to act on that expectation, based on what we want. I think this basic confusion is behind a lot of Eliezer’s discussion of LLMs that strike me as misleading, such as his frequent invocation of “an actress behind the mask”. For much more discussion of this point, see @Zack_M_Davis’s fictional dialog trilogy (1, 2, 3). (Having said all that, I want to reiterate that “trained LLM behavior is just like human behavior” is also wrong, thanks to the post-training and generalization issues, as discussed in this section.)

^{^}

See The Psychopath Test by Ronson (“All those chats about empathy were like an empathy-faking finishing school for him…” p88).To be clear, that there do seem to be interventions that appeal to psychopaths’ own self-interest—particularly their selfish interest in not being in prison—to help turn really destructive psychopaths into the regular everyday kind of psychopaths who are still awful to the people around them but at least they’re not murdering anyone. (Source.)

^{^}

Both Anthropic and OpenAI have admitted to this error, see links here.

^{^}

For a real-world example, @Matthew Barnett is much more polite about it, but I think he’s getting at the same idea when he writes: “The fact that GPT-4 can reliably follow basic instructions, is able to distinguish moral from immoral actions somewhat reliably, and generally does what I intend rather than what I literally asked, is all evidence that the value identification problem is easier than how MIRI people originally portrayed it. While I don't think the value identification problem has been completely solved yet, I don't expect near-future AIs will fail dramatically on the ‘fill a cauldron’ task [referring to The Sorcerer’s Apprentice as discussed by Nate Soares here], or any other functionally similar tasks.”

^{^}

To be clear, the idea of leveraging inner misalignment to mitigate issues from outer misalignment is less crazy than it might sound, see §10.6 here.

^{^}

I think trained classifiers are tangentially involved in human social instincts, but they are relatively simple things like “this audio clip is probably the sound of a human voice”. They don’t have to be perfect. See §1 of “Neuroscience of human social instincts: a sketch”.

^{^}

Warning: The term “reward hacking” has recently started being used for the broader notion of “lying and cheating”, even when the lying and cheating is not directly related to any reward function. See my nitpicky complaint here.

^{^}

The term “exploration hacking” is either synonymous with, or a special case of, what I’ve been calling “deliberate incomplete exploration”.

^{^}

It’s a bit misleading to describe Paul Christiano as “LLM-focused”, since he’s been saying generally consistent things about the alignment problem since well before LLMs. But I think he has always had in mind AI systems which centrally involve “magically transmuting observations (of humans) into behaviors”. For example, his pre-LLM “Iterated Amplification” schemes often (not always) involved imitating human-created data.

^{^}

Eliezer, in turn, is deliriously over-optimistic compared to Roman Yampolskiy, who argues that technical alignment is impossible ¯\_(ツ)_/¯

^{^}

For my latest thinking on the nature of those “non-behaviorist” reward functions, see Neuroscience of human social instincts: a sketch.

LESSWRONG
LW

LESSWRONG
LW

152

Foom & Doom 2: Technical alignment is hard

152

Ω 56

152

Ω 56

2.1 Summary & Table of contents

2.2 Background: my expected future AI paradigm shift

2.3 On the origins of egregious scheming

2.3.1 “Where do you get your capabilities from?”

2.3.2 LLM pretraining magically transmutes observations into behavior, in a way that is profoundly disanalogous to how brains work

2.3.3 To what extent should we think of LLMs as imitating?

2.3.4 The naturalness of egregious scheming: some intuitions

2.3.5 Putting everything together: LLMs are generally not scheming right now, but I expect future AI to be disanalogous

2.4 I’m still worried about the ‘literal genie’ / ‘monkey’s paw’ thing

2.4.1 Sidetrack on disanalogies between the RLHF reward function and the brain-like AGI reward function

2.4.2 Inner and outer misalignment

2.5 Open-ended autonomous learning, distribution shifts, and the ‘sharp left turn’

2.6 Problems with “amplified oversight”

2.7 Downstream impacts of “Technical alignment is hard”

2.8 Bonus: Technical alignment is not THAT hard

2.8.1 I think we’ll get to pick the innate drives (as opposed to the evolution analogy)

2.8.2 I’m more bullish on “impure consequentialism”

2.8.3 On the narrowness of the target

2.9 Conclusion and takeaways

2.9.1 If brain-like AGI is so dangerous, shouldn’t we just try to make AGIs via LLMs?

2.9.2 What’s to be done?