[ Question ]

Forget AGI alignment, are you aligned with yourself?

by Samuel Shadrach1 min read13th Oct 202115 comments

2

Rationality
Frontpage

Here's the problem to solve:

 - value alignment between AI and broad societal values

Here's some other unsolved problems:

 - value alignment across factions of society

 - value alignment between two members of society

 - value alignment between two members of society who have a a 99% overlap in their values and beliefs

 - value alignment between a person and the very same person from 5 years ago

 - value alignment between a person and a clone of the person with cognitive superpowers

 - value alignment between a person and a clone of the person in a different mood

 - value alignment between a person and a clone of the person

This might sound weird, but assume I was locked in a room with a clone of myself. Also in the room is a button that grants absolute totalitarian control of the future of humanity via a ton of previously uninvented tech. I imagine it would start with two of us bein vary - physically protecting ourselves, not pushing the button, and starting a a conversation, both of us unsure both of our own goals and how well they align with the other person in the room. Even though the other person is my clone, we could interact asymmetrically in a conversation due to non-deterministic effects or asymmetric instantiation (we both walked into the room with different thoughts in our mind). And as the conversation evolves, we will diverge - be it over minutes or weeks. I can totally imagine a future where we both hit on some key point of difference in the conversation that causes us to fight to the death in the very same room, for that control. (Assuming ofcourse we both care that strongly about the future of humanity to begin with.)

Wondering if anyone else relates to this.

New Answer
Ask Related Question
New Comment

3 Answers

Both of me would rush to the button as fast as possible. It doesn't much matter which copy gets there first, since we're copies, we'll take care of the slower one. And one of me pressing it is vastly better than anyone else doing so.

A diverged copy would need to be very different before I fought them. I have a strong gut belief in the idea that you should cooperate with people like you, and a copy is maximally similar. Even a diverged copy is still going to be quite similar.

I would be very wary of trying any deception or defecting because the other copy knows me and if they're diverged they may or may not be stronger than me. My first course of action would be to go for the button, and if they stopped me, to collect more information. But... the whole cooperate with similar people thing is firmly subordinate to "look out for number one" and if I had no other choice, I'd fight them to protect myself.

Thanks for replying!
I generally agree with your intuition that similar people arw worth cooperating with, but I also feel like when the stakes are high this can break down. Maybe I should have defined the hypothetical such that you're both co-rulers or something until one kills the other.

Cause like - worst case in a fight is you lose and the clone does what they want - which is already not that awful (probably), this is already guaranteed. But you may still believe you have something non-trivially better to offer than this worst case. And you may be willing to fight for it. (Just my intuitions :p)

Do you have thoughts on what you'll do once you're the ruler?

2Raven20hWorst case, I lose, and the clone uses their power to contain me so I stop being a danger to them. Or they just kill me, lol. If the power was shareable, that greatly increases the divergence needed for a fight. Most disagreements could be easily solved by each of us taking half. Self preservation isn't worth risking to make a few changes to the copy's plans. With the power, probably indulge in unbridled hedonism for a while. Eventually I'd get bored tho and start trying to build with it. Hedonism is fun and destruction is easy, but creation is challenging and satisfying in a way neither of them are. Transhumanism and the stars are our destiny!
1Samuel Shadrach8hCould you please elaborate on this? Would this mean you personally value your own life pretty highly (relative to rest of humanity)? Makes sense, can totally relate!

Folks sometimes talk about the human alignment problem, which I think is what you're getting at. I think the earliest instance of it can be found in this post. Searching for "human alignment problem" on LW will turn up more stuff, although I don't think anyone has done an exhaustive post summarizing what we really mean by this, though it's generally one of two things:

  1. humans aligning themselves to goals (getting yourself to do something)
  2. aligning multiple humans to a goal (like in an organization)

Thank you, I will check!

Pretty sure me and my clone both race to push the button the second we enter the room.  I don't  think this has to do with "alignment" per se, though.  We both have exactly the same goal: "claim the button for myself" and that sense are perfectly "aligned".

If you trust that the other person has identical goals to yours, will it matter to you who presses the button? Say you both race for the button, you both collide into each other but miss the button. Will you now fight or graciously let the other person press it?

2Logan Zoellner13h"have absolute power" is one of my goals. "Let my clone have absolute power" is way lower on the list. I can imagine situations in which I would try to negotiate something like "create two identical copies of the universe in which we both have absolute power and can never interfere with one another". But negotiating is hard, and us fighting seems like a much more likely outcome.
6 comments, sorted by Highlighting new comments since Today at 4:07 PM

When preference makes references to self, copying (that doesn't also edit these references) changes the meaning of preference, doesn't preserve it. So if you are expecting copying, reflective consistency could be ensured by reformulating preference to avoid explicit references to self, such as by replacing them with references to a particular person, or to a reference class of people, whether they are yourself or not.

Interesting point!

But if human preferences make references to self, then those preferences are also relevant to the AGI alignment problem. (trying to make AI have the same preferences that humans have).

Although I guess my example was also about:
Even if human's terminal preferences do not make references to self, they will still instrumentally value self and not the clone, because of lack of trust in clone's preferences being 100% identical.

The reformulation of preference to replace references to self with specific people it already references doesn't change its meaning, so semantically such rewriting doesn't affect alignment. It only affects copying, which doesn't respect semantics of preferences. Other procedures that meddle with minds can disrupt semantics of preference in a way that can't be worked around.

(All this only makes sense for toy agent models, that furthermore have a clear notion of references to self, not for literal humans. Humans don't have preferences in this sense, human preference is a theoretical construct that needs something like CEV to access, the outcome of a properly set up long reflection.)

Yup makes sense. But I also feel the "toy agent model" of terminal and instrumental preferences has real life implications (even though it may not be best model). Namely that you will always value yourself over your clone for instrumental reasons if you're not perfect clones. And I also tend to feel the extent to which you value yourself over your clone will be high in such high stakes situations.

Keen on anything that would weaken / strengthen my intuitions on this :) 

I'd be more interested in how I can get out of the locked room. If the only way to do that is for one of us to press the button, one of us might eventually press it.

If we cared more about the future of humanity, then we'd probably have to stage a hunger strike (possibly to death) instead, and that would be really unpleasant and still no guarantee that whoever stuck us in this room wouldn't just go and pick someone else to do it.

I know that if I pressed the button that I'd be an awful world dictator and wouldn't even enjoy it, probably even worse at finding someone else who would be any better, and in any event the world would have to immediately deal with the fact that there's a shitload of new tech around that seems to have the primary function of instilling totalitarian control over a civilization. The world would be fucked. If there's some super-intelligence out there planning to lock someone in a room with a clone and a world domination button, don't pick me.

Interesting. Let's say you both agree to leave the room. Would you later feel guilt looking at the all the suffering in the world, knowing you could have helped prevent it? Be it genocides, world wars, misaligned AI, Zuckerberg becoming the next dictator, or something else.