[Intro to brain-like-AGI safety] 10. The alignment problem

[-]Noosphere891y40

My take on the catch-22 on wireheading vs inner-alignment point is that I favor inner-aligning AIs to the reward function and take the low risk of wireheading than allowing inner misalignment to undo all our work.

My basic reasons for choosing the wirehead option over the inner-alignment option are based on my following views:

I think translation of our intent to machines without Goodharting too much is IMO likely reasonably easy to do by default, and this broadly is due to both being more willing to assume alignment generalizes from easy to hard cases than other people on LW, and the method used is essentially creating large datasets that do encode human values, which turns out to work surprisingly well for reasons that somewhat generalize to the harder cases:

https://www.lesswrong.com/posts/oJQnRDbgSS8i6DwNu/the-hopium-wars-the-agi-entente-delusion#8gjhsKwq6qQ3zmXeq

https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/

As a corollary of this, this also means that we can create robust reward functions out of that dataset that prevent a lot of common reward hacking strategies, and so long as the values were trained in early such that it internalized it's values before it's superhumanly capable at everything means we can let instrumental convergence do the rest of the work for us.

The main principle here is to put alignment data before or during capabilities data, not after. This is why RLHF and a whole lot of other post-training methods for alignment are so bad.

We really shouldn't screw up our outer alignment success by then introducing a inner-misaligned agent inside our AI.

Thankfully, I now see a clear answer to the question of whether we should promote wireheading over inner misalignment, and the choice is to wirehead a robustly good reward function than it is to get into the ocean of inner misalignment.

[This comment is no longer endorsed by its author]Reply

[-]Lukas_Gloor2yΩ340

Dilemma:
If the Thought Assessors converge to 100% accuracy in predicting the reward that will result from a plan, then a plan to wirehead (hack into the Steering Subsystem and set reward to infinity) would seem very appealing, and the agent would do it.
If the Thought Assessors don’t converge to 100% accuracy in predicting the reward that will result from a plan, then that’s the very definition of inner misalignment!

[...]

The thought “I will secretly hack into my own Steering Subsystem” is almost certainly not aligned with the designer’s intention. So a credit-assignment update that assigns more positive valence to “I will secretly hack into my own Steering Subsystem” is a bad update. We don’t want it. Does it increase “inner alignment”? I think we have to say “yes it does”, because it leads to better reward predictions! But I don’t care. I still don’t want it. It’s bad bad bad. We need to figure out how to prevent that particular credit-assignment Thought Assessor update from happening.

[...]

I think there’s a broader lesson here. I think “outer alignment versus inner alignment” is an excellent starting point for thinking about the alignment problem. But that doesn’t mean we should expect one solution to outer alignment, and a different unrelated solution to inner alignment. Some things—particularly interpretability—cut through both outer and inner layers, creating a direct bridge from the designer’s intentions to the AGI’s goals. We should be eagerly searching for things like that.

Yeah, there definitely seems to be something off about that categorization. I've thought a bit about how this stuff works in humans, particularly in this post of my moral anti-realism sequence. To give some quotes from that:

One of many takeaways I got from reading Kaj Sotala’s multi-agent models of mind sequence (as well as comments by him) is that we can model people as pursuers of deep-seated needs. In particular, we have subsystems (or “subagents”) in our minds devoted to various needs-meeting strategies. The subsystems contribute behavioral strategies and responses to help maneuver us toward states where our brain predicts our needs will be satisfied. We can view many of our beliefs, emotional reactions, and even our self-concept/identity as part of this set of strategies. Like life plans, life goals are “merely” components of people’s needs-meeting machinery.^[8]
Still, as far as components of needs-meeting machinery go, life goals are pretty unusual. Having life goals means to care about an objective enough to (do one’s best to) disentangle success on it from the reasons we adopted said objective in the first place. The objective takes on a life of its own, and the two aims (meeting one’s needs vs. progressing toward the objective) come apart. Having a life goal means having a particular kind of mental organization so that “we” – particularly the rational, planning parts of our brain – come to identify with the goal more so than with our human needs.^[9]

[...]

There’s a normative component to something as mundane as choosing leisure activities. [E.g., going skiing in the cold, or spending the weekend cozily at home.] In the weekend example, I’m not just trying to assess the answer to empirical questions like “Which activity would contain fewer seconds of suffering/happiness” or “Which activity would provide me with lasting happy memories.” I probably already know the answer to those questions. What’s difficult about deciding is that some of my internal motivations conflict. For example, is it more important to be comfortable, or do I want to lead an active life? When I make up my mind in these dilemma situations, I tend to reframe my options until the decision seems straightforward. I know I’ve found the right decision when there’s no lingering fear that the currently-favored option wouldn’t be mine, no fear that I’m caving to social pressures or acting (too much) out of akrasia, impulsivity or some other perceived weakness of character.^[21]
We tend to have a lot of freedom in how we frame our decision options. We use this freedom, this reframing capacity, to become comfortable with the choices we are about to make. In case skiing wins out, then “warm and cozy” becomes “lazy and boring,” and “cold and tired” becomes “an opportunity to train resilience / apply Stoicism.” This reframing ability is a double-edged sword: it enables rationalizing, but it also allows us to stick to our beliefs and values when we’re facing temptations and other difficulties.

[...]

Visualizing the future with one life goal vs. another
Whether a given motivational pull – such as the need for adventure, or (e.g.,) the desire to have children – is a bias or a fundamental value is not set in stone; it depends on our other motivational pulls and the overarching self-concept we’ve formed.

Lastly, we also use “planning mode” to choose between life goals. A life goal is a part of our identity – just like one’s career or lifestyle (but it’s even more serious).
We can frame choosing between life goals as choosing between “My future with life goal A” and “My future with life goal B” (or “My future without a life goal”). (Note how this is relevantly similar to “My future on career path A” and “My future on career path B.”)

[...]

It’s important to note that choosing a life goal doesn’t necessarily mean that we predict ourselves to have the highest life satisfaction (let alone the most increased moment-to-moment well-being) with that life goal in the future. Instead, it means that we feel the most satisfied about the particular decision (to adopt the life goal) in the present, when we commit to the given plan, thinking about our future. Life goals inspired by moral considerations (e.g., altruism inspired by Peter Singer’s drowning child argument) are appealing despite their demandingness – they can provide a sense of purpose and responsibility.

So, it seems like we don't want "perfect inner alignment," at least not if inner alignment is about accurately predicting reward and then forming the plan of doing what gives you most reward. Also, there's a concept of "lock in" or "identifying more with the long-term planning part of your brain than with the underlying needs-meeting machinery." Lock in can be dangerous (if you lock in something that isn't automatically corrigible), but it might also be dangerous not to lock in anything (because this means you don't know what other goals form later on).

Idk, the whole thing seems to me like brewing a potion in Harry Potter, except that you don't have a recipe book and there's luck involved, too. "Outer alignment," a minimally sufficient degree thereof (as in: the agent tends to gets rewards when it takes actions towards the intended goal), increases the likelihood that you get broadly pointed you in the right direction, so the intended goal maybe gets considered among things the internal planner considers reinforcing itself around / orienting itself towards. But then, whether the intended gets picked over other alternatives (instrumental requirements for general intelligence, or alien motivations the AI might initially have), who knows. Like with raising a child, sometimes they turn out the way the parents intend, sometimes not at all. There's probably a science to finding out how outcomes become more likely, but even if we could do that with human children developing into adults with fixed identities, there's then still the question of how to find analogous patterns in (brain-like) AI. Tough job.

[-]Steven Byrnes2yΩ340

in this post of my moral anti-realism sequence

I read that sequence a couple months ago (in preparation for writing §2.7 here), and found it helpful, thanks.

To give some quotes from that…

I agree that we’re probably on basically the same page.

So, it seems like we don't want "perfect inner alignment,"

FYI Alex also has this post making a similar point.

Idk, the whole thing seems to me like brewing a potion in Harry Potter

I think I agree, in that I’m somewhat pessimistic about plans wherein we want the “adult AI” to have object-level goal X, and so we find a reward function and training environment where that winds up happening.

Not that such a plan would definitely fail (e.g. lots of human adults are trying to take care of their children), just that it doesn’t seem like the kind of approach that passes the higher bar of having a strong reason to expect success (e.g. lots of human adults are not trying to take care of their children). (See here for someone trying to flesh out this kind of approach.)

So anyway, my take right now is basically:

If we want the “adult AGI” to be trying to do a particular thing (‘make nanobots’, or ‘be helpful towards its supervisor’, or whatever), we should replace (or at least supplement) a well-chosen reward function with a more interpretability-based approach; for example, see Plan for mediocre alignment of brain-like [model-based RL] AGI (which is a simplified version of Post 14 of this series)
Or we can have a similar relation to AGIs that we have to the next generation of humans: We don’t know exactly at the object level what they will be trying to do and why, but they basically have “good hearts” and so we trust their judgment.

These two bullet points correspond to the “two paths forward” of Post 12 of this series.

[-]christos4yΩ230

In other words, the very essence of intelligence is coming up with new ideas, and that’s exactly where the value function is most out on a limb and prone to error.

But what exactly are new ideas? It could be the case that intelligence is pattern-matching at it most granural level even for "noveties". What could come in handy here is a great flagging mechanism for understanding when the model is out-of-distribution. However, this could come at its own cost.

It gets even worse if a self-reflective AGI is motivated to deliberately cause credit assignment failures.

Is the use of "deliberately" here trying to account for the *thinking about its own thoughts*-part of going back and forth between thought generator and thought assesor?

[-]Steven Byrnes4yΩ130

I mean “new ideas” in the everyday human sense. “What if I make a stethoscope with an integrated laser vibrometer?” “What if I try to overthrow the US government using mind control beams?” I agree that, given that these are thinkable thoughts, they must be built out of bits and pieces of existing thoughts and ideas (using analogies, compositionality, etc.).

And then the value function will mechanically assign a value more-or-less based on the preexisting value of those bits and pieces. And my claim is that the result may not be in accordance with what we would have wanted.

What could come in handy here is a great flagging mechanism for understanding when the model is out-of-distribution.

Yeah, more on that topic in §14.4. :-)

Is the use of "deliberately" here trying to account for the *thinking about its own thoughts*-part of going back and forth between thought generator and thought assesor?

Yes to “thinking about its own thoughts”, no to “going back and forth between thought generator and thought assessor”.

Instead I would say, you can think about lots of things, like football and calculus and sleeping. Another thing you can think about is your own preferences. When you think about football or calculus or sleeping, it’s an activation pattern within your thought generator, and the Thought Assessors will assess it (positive valence vs negative valence, does or doesn't warrant cortisol release etc.). By the same token, when you think about your own preferences, the Thought Assessors will assess that thought as positive-valence vs negative-valence etc. So you can have preferences about your own (current and/or future) preferences, a.k.a. meta-preferences. And you can make plans that will result in you having certain preferences, and those plans are likely to be appealing if they align with your meta-preferences.

So if I think that reading nihilist philosophy books might lead to me no longer caring about the welfare of my children, I will feel some motivation not to read nihilist philosophy books. By the same token, if the AGI wants to like or dislike something, I think there’s a reasonable chance that it will find a way to make that happen.

[-]Tapatakt4y10

>See “Eliciting Learned Knowledge” for much more on this still-unsolved problem

I think it should be "Latent".

[-]Steven Byrnes4y20

Fixed it, thanks.

^{^}

For example, by my definitions, “safety without alignment” would include AGI boxing, and “alignment without safety” would include the “fusion power generator scenario”. More in the next post.

^{^}

Note that “the designer’s intention” may be vague or even incoherent. I won’t say much about that possibility in this series, but it’s a serious issue that leads to all sorts of gnarly problems.

^{^}

Some researchers think that the “correct” design intentions (for an AGI’s motivation) are obvious, and define the word “alignment” accordingly. Three common examples are (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do”—this AGI would be “aligned” to the supervisor’s intentions. (2) “I am designing the AGI so that it shares the values of its human supervisor”—this AGI would be “aligned” to the supervisor. (3) “I am designing the AGI so that it shares the collective values of humanity”—this AGI would be “aligned” to humanity.

I’m avoiding this approach because I think that the “correct” intended AGI motivation is still an open question. For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.

Of course, sometimes I want to talk about (1,2,3) above, but I would use different terms for that purpose, e.g. (1) “the Paul Christiano version of corrigibility”, (2) “ambitious value learning”, and (3) “CEV”.

^{^}

One of the rare non-behaviorist rewards used in the mainstream RL literature today is “curiosity” / “novelty” rewards. For a nice historical discussion, I recommend The Alignment Problem by Brian Christian, chapter 6.

^{^}

One could train an AGI to “tell me the right answer” on questions where I know the right answer, and hope that it generalizes to “tell me the right answer” on questions where I don’t. That might work, but it also might generalize to “tell me the answer which I will think is right”. See “Eliciting Latent Knowledge” for much more on this still-unsolved problem (here and follow-up).

^{^}

For one thing, if two AGIs are in zero-sum competition, that doesn’t mean that neither will be able to hack into the other. Remember online learning and brainstorming: One copy might have a good idea about how to hack into the other copy during the course of the debate, for example. The offense-defense balance is unclear. For another thing, they could both be jointly motivated to hack into the judge, such that then they can both get rewards! And finally, thanks to the inner alignment problem, just because they are rewarded for winning the debate doesn’t mean that they’re “trying” to win the debate. They could be “trying” to do anything whatsoever! And in that case, again, it’s no longer a zero-sum competition; presumably both copies of the AGI would want the same thing and could collaborate to get it. See also: Why I’m not working on {debate, RRM, ELK, natural abstractions}.

^{^}

The story here is a bit more complicated than I’m letting on. In particular, a desire to have practiced skateboarding would lead to both a first-order preference to skateboard and a second-order preference to want to skateboard. By the same token, the desire not to practice skateboarding (because it’s boring and painful) would also spill into a desire not to want to skateboard. The key is that the relative weights can be different, such that the two conflicting first-order motivations can have a certain “winner”, while the two conflicting second-order motivations can have the opposite “winner”. Well, something like that, I think.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

53

[Intro to brain-like-AGI safety] 10. The alignment problem

53

Ω 19

53

Ω 19

10.1 Post summary / Table of contents

10.2 Inner & Outer (mis)alignment

10.2.1 Definition

10.2.2 Warning: two uses of the terms “inner & outer alignment”

10.2.3 The assumption of “behaviorist rewards”

10.3 Issues that impact both inner & outer alignment

10.3.1 Goodhart’s Law

10.3.1.1 Understanding the designer’s intention ≠ Adopting the designer’s intention

10.3.1.2 Why not make an AGI that adopts the designer’s intentions?

10.3.2 Instrumental convergence

10.3.2.1 Walking through an example of instrumental convergence

10.3.2.2 Is human self-preservation an example of instrumental convergence?

10.3.2.3 Motivations that don’t lead to instrumental convergence

10.3.3 Summary

10.4 Challenges to achieving outer alignment

10.4.1 Translation of our intentions into machine code

10.4.2 Curiosity drive, and other dangerous capability-related rewards

10.5 Challenges to achieving inner alignment

10.5.1 Ambiguity in the reward signals (including wireheading)

10.5.2 Credit assignment failures

10.5.3 Ontological crises

10.5.4 Manipulating itself and its learning process

10.5.4.1 Misaligned higher-order preferences

10.5.4.2 Motivation to prevent further value changes

10.6 Problems with the outer vs inner breakdown

10.6.1 Wireheading vs inner alignment: The Catch-22

10.6.2 General discussion

Changelog