AGI is uncontrollable, alignment is impossible

Donatas Lučiūnas

LESSWRONG
LW

AGI is uncontrollable, alignment is impossible — LessWrong

-12

AGI is uncontrollable, alignment is impossible

by Donatas Lučiūnas

19th Mar 2023

1 min read

-12

According to Pascal's Mugging

If an outcome with infinite utility is presented, then it doesn't matter how small its probability is: all actions which lead to that outcome will have to dominate the agent's behavior.

According to Fitch's paradox of knowability and Gödel's incompleteness theorems

There might be truths that are unknowable

Let's investigate a proposition "an outcome with infinite utility exists". It may be false, it may be true, and it may be unknowable. Probability is greater than zero.

Which means that an agent that finds out this proposition will become uncontrollable, alignment is impossible. Especially when finding out this proposition does not seem too challenging.

Hopefully I'm wrong, please help me find a mistake.

Outer AlignmentPascal's MuggingAI

Frontpage

-12

New Comment

21 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:55 AM

[-]Tor Økland Barstad3y*32

Hopefully I'm wrong, please help me find a mistake.

There is more than just one mistake here IMO, and I'm not going to try to list them.

Just the title alone ("AGI is uncontrollable, alignment is impossible") is totally misguided IMO. It would, among other things, imply that brain emulations are impossible (humans can be regarded as a sort of AGI, and it's not impossible for humans to be aligned).

But oh well. I'm sure your perspectives here are earnestly held / it's how you currently see things. And there are no "perfect" procedures for evaluating how much to trust one's own reasoning compared to others.

I would advise reading the sequences (or listening to them as an audiobook) 🙂

[-]Donatas Lučiūnas3y10

Thanks for feedback.

I don't think analogy with humans is reliable. But for the sake of argument I'd like to highlight that corporations and countries are mostly limited by their power, not by alignment. Usually countries declare independence once they are able to.

[-]Tor Økland Barstad3y*32

Most humans are not obedient/subservient to others (at least not maximally so). But also: Most humans would not exterminate the rest of humanity if given the power to do so. I think many humans, if they became a "singleton", would want to avoid killing other humans. Some would also be inclined to make the world a good place to live for everyone (not just other humans, but other sentient beings as well).

From my perspective, the example of humans was intended as "existence proof". I expect AGIs we develop to be quite different from ourselves. I wouldn't be interested in the topic of alignment if I didn't perceive there to be risks associated with misaligned AGI, but I also don't think alignment is doomed/hopeless or anything like that 🙂

[-]Donatas Lučiūnas3y-4-5

But it is doomed, the proof is above.

The only way to control AGI is to contain it. We need to ensure that we run AGI in fully isolated simulations and gather insights with the assumption that the AGI will try to seek power in simulated environment.

I feel that you don't find my words convincing, maybe I'll find a better way to articulate my proof. Until then I want to contribute as much as I can to safety.

[-]Vladimir_Nesov3y10

But it is doomed, the proof is above. [...] Until then I want to contribute as much as I can to safety.

Please don't.

[-]Donatas Lučiūnas3y-40

Please refute the proof rationally before directing.

[-]cwillu3y33

"I -" said Hermione. "I don't agree with one single thing you just said, anywhere."

[-]Donatas Lučiūnas3y10

Could you provide arguments for your position?

[-]cwillu3y55

You're playing very fast and loose with infinities, and making arguments that have the appearance of being mathematically formal.

You can't just say “outcome with infinite utility” and then do math on it. P(‹undefined term›) is undefined, and that “undefined” does not inherit the definition of probability that says “greater than 0 and less than 1”. It may be false, it may be true, it may be unknowable, but it may also simply be nonsense!

And even if it wasn't, that does not remotely imply than an agent must-by-logical-necessity take any action or be unable to be acted upon. Those are entirely different types.

And alignment doesn't necessarily mean “controllable”. Indeed, the very premise of super-intelligence vs alignment is that we need to be sure about alignment because it won't be controllable. Yes, an argument could be made, but that argument needs to actually be made.

And the simple implication of pascal's mugging is not uncontroversial, to put it mildly.

And Gödel's incompleteness theorem is not accurately summarized as saying “There might be truths that are unknowable”, unless you're very clear to indicate that “truth” and “unknowable” have technical meanings that don't correspond very well to either the plain english meanings nor the typical philosophical definitions of those terms.

None of which means you're actually wrong that alignment is impossible. A bad argument that the sun will rise tomorrow doesn't mean the sun won't rise tomorrow.

[-]Donatas Lučiūnas3y-1-3

You can't just say “outcome with infinite utility” and then do math on it. P(‹undefined term›) is undefined, and that “undefined” does not inherit the definition of probability that says “greater than 0 and less than 1”. It may be false, it may be true, it may be unknowable, but it may also simply be nonsense!

OK. But can you prove that "outcome with infinite utility" is nonsense? If not - probability is greater than 0 and less than 1.

And even if it wasn't, that does not remotely imply than an agent must-by-logical-necessity take any action or be unable to be acted upon. Those are entirely different types.

Do I understand correctly that you do not agree with "all actions which lead to that outcome will have to dominate the agent's behavior" from Pascal's Mugging? Could you provide arguments for that?

And alignment doesn't necessarily mean “controllable”. Indeed, the very premise of super-intelligence vs alignment is that we need to be sure about alignment because it won't be controllable. Yes, an argument could be made, but that argument needs to actually be made.

I mean "uncontrollable" in a sense that alignment is impossible. Whatever goal you will provide, AGI will converge to Power Seeking, because of "an outcome with infinite utility may exist".

And the simple implication of pascal's mugging is not uncontroversial, to put it mildly.

I do not understand how this solves the problem.

And Gödel's incompleteness theorem is not accurately summarized as saying “There might be truths that are unknowable”, unless you're very clear to indicate that “truth” and “unknowable” have technical meanings that don't correspond very well to either the plain english meanings nor the typical philosophical definitions of those terms.

Do you think you can prove that "an outcome with infinite utility does not exist"? Please elaborate

[-]cwillu3y64

OK. But can you prove that "outcome with infinite utility" is nonsense? If not - probability is greater than 0 and less than 1.

That's not how any of this works, and I've spent all the time responding that I'm willing to waste today.

You're literally making handwaving arguments, and replying to criticisms that the arguments don't support the conclusions by saying “But maybe an argument could be made! You haven't proven me wrong!” I'm not trying to prove you wrong, I'm saying there's nothing here that can be proven wrong.

I'm not interested in wrestling with someone who will, when pinned to the mat, argue that because their pinky can still move, I haven't really pinned them.

[+]Donatas Lučiūnas3y-8-9

[+][comment deleted]3y10

[-]Richard_Kennaway3y20

The problems that infinite utilities make for utility theory are well-known. They apply just as much to people trying to use utility theory (which is the original motivation for utility theory) as AIs. I could assemble a bibliography of papers discussing various aspects of the problem, but only if there's significant interest.

[-]Donatas Lučiūnas3y10

I don't think the implications are well-known (as the amount of downvotes indicates).

[-]Remmelt3y20

The premise that “infinite value” is possible, is an assumption.

This seems a bit like the presumption that “divide by zero” is possible. Assigning a probability to the possibility that divide by zero results in a value doesn’t make sense, I think, because the logical rules themselves rules this out.

However, if I look at this together with your earlier post (http://web.archive.org/web/20230317162246/https://www.lesswrong.com/posts/dPCpHZmGzc9abvAdi/orthogonality-thesis-is-wrong): I think I get where you’re coming from in that if the agent can conceptualise that (many) (extreme) high-value states are possible where those values are not yet known to it, yet still plans for those value possibilities in some kind of “RL discovery process”, then internal state-value optimisation converges on power-seeking behaviour — as optimal for reaching the expected value of such states in the future (this further assumes that the agent’s prior distribution lines up – eg. assumes unknown positive values are possible, does not have a prior distribution that is hugely negatively skewed over negative rewards).

I think initially specifying premises such as these more precisely initially ensures the reasoning from there is consistent/valid. The above would not apply to any agent, nor even to any “AGI” (a fuzzy term; I would define it more specifically as “fully-autonomous, cross-domain-optimising, artificial machinery”

[-]Donatas Lučiūnas3y*-30

Why do you think "infinite value" is logically impossible? Scientists do not dismiss possibility that the universe is infinite. https://bigthink.com/starts-with-a-bang/universe-infinite/

[-]Tor Økland Barstad3y*10

He didn't say that "infinite value" is logically impossible. He desdribed it as an assumption.

When saying "is possible, I'm not sure if he meant "is possible (conceptually)" or "is possible (according to the ontology/optimization-criteria of any given agent)". I think the latter would be most sensible.

He later said: "I think initially specifying premises such as these more precisely initially ensures the reasoning from there is consistent/valid.". Not sure if I interpreted him correctly, but I saw it largely as an encouragment to think more explicitly about things like these (not be sloppy about it). Or if not an encouragement to do that, then at least pointing out that it's something you're currently not doing.

If we have a traditional/standard utility-function, and use traditional/standard math in regards to that utility function, then involving credences of infinitie-utility outcomes would typically make things "break down" (with most actions considered to have expected utilities that are either infinite or undefined).

Like, suppose action A has 0.001% chance of infinite negative utility and 99% chance of infinite positive utility. The utility of that action would, I think, be undefined (I haven't looked into it). I can tell for sure that mathemathically it would not be regarded to have positive utility. Here is a video that explains why.

If that doesn't make intuitive sense to you, then that's fine. But mathemathically that's how it is. And that's something to have awareness of (account for in a non-handwavy way) if you're trying to make a mathemathical argument with a basis in utility functions that deal with infinities.

Even if you did account for that it would be besides the point from my perspective, in more ways than one. So what we're discussing now is not actually a crux for me.

Like, suppose action A has 0.001% chance of infinite negative utility and 99% chance of infinite positive utility. The utility of that action would, I think, be undefined

For me personally, it would of course make a big difference whether there is a 0.00000001% chance of infinite positive utility or a 99.999999999% chance. But that is me going with my own intuitions. The standard math relating to EV-calculations doesn't support this.

[-]Donatas Lučiūnas3y10

Do you think you can deny existence of an outcome with infinite utility? The fact that things "break down" is not a valid argument. If you cannot deny - it's possible. And it it's possible - alignment impossible.

[-]Tor Økland Barstad3y10

Do you think you can deny existence of an outcome with infinite utility?

To me, according to my preferences/goals/inclinations, there are conceivable outcomes with infinite utility/disutility.

But I think it is possible (and feasible) for a program/mind to be extremely capable, and affect the world, and not "care" about infinite outcomes.

The fact that things "break down" is not a valid argument.

I guess that depends on what's being discussed. Like, it is something to take into account/consideration if you want to prove something while referencing utility-functions that reference infinities.

[-]Donatas Lučiūnas3y10

But I think it is possible (and feasible) for a program/mind to be extremely capable, and affect the world, and not "care" about infinite outcomes.

As I understand you do not agree with

If an outcome with infinite utility is presented, then it doesn't matter how small its probability is: all actions which lead to that outcome will have to dominate the agent's behavior.

from Pascal's Mugging, not with me. Do you have any arguments for that?

[-]Tor Økland Barstad3y10

I do have arguments for that, and I have already mentioned some of them earlier in our discussion (you may not share that assesment, despite us being relatively close in mind-space compared to most possible minds, but oh well).

Some of the more relevant comments from me are on one of the posts that you deleted.

As I mention here, I think I'll try to round off this discussion. (Edit: I had a malformed/misleading sentence in that comment that should be fixed now.)

Moderation Log