Perhaps this is a well examined idea, but i didn't find anything when searching.

The argument is simple. If the AI wants to avoid the world being in a certain state, destroying everything reduces the likelihood of that state to occur to zero. Otherwise, the likelihood might always be non-zero. This particular strategy has in some sense been observed already. Ml agents in games have been observed to crash the game in order to avoid a negative reward.  

If most simple negative reward functions could achieve an ideal outcome by destroying everything, and destroying everything is possible, then a large portion of negative reward functions might be highly dangerous since the reward function would be incentivized to find any information that could end everything. No need for greedy mesa-optimizers that never get satisfied. This seems to have implications for alignment if true. 

This also seems to have implications for anthropic reasoning in favor of the doomsday argument. If the entire universe gets destroyed, then our early existence and small numbers makes perfect sense. 

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 12:17 PM
[-][anonymous]7mo51

The basic idea seems true to me (for the probably-rare subclass of agents which terminally value preventing some worldstate from occurring; and conditional on destroying the universe being possible), and you've phrased it concisely, so I've strong-upvoted (to counteract downvotes). I'm not sure about the anthropic argument at the bottom, but I think it's good that you're trying to think creatively.

I'm reminded of an excerpt from the value handshakes tag page

nobody knows what kind of scorched-earth strategy a losing superintelligence might be able to use to thwart its conqueror, but it could potentially be really bad – eg initiating vacuum collapse and destroying the universe.

Either way, the universe getting destroyed and the universe being tiled with non-conscious computronium meant to prevent a certain state from occurring seem about equally valueless to me.

As for the doomsday argument, I'm not sure it holds. We'd exist now no matter what happens in the future, so I don't know if "conditioning on being an observer" makes sense here. It seems akin to the religious view of humans as having been preexisting "souls" which were then reincarnated as random observers anywhere in time. Wheras in reality, we were instantiated by temporally-local physical interactions, so the probability of us experiencing what we do right now is 1 no matter what happens in the future.

I would also note the space of possible explanations for arguments such as the doomsday argument and the fermi paradox is large, so I think the fact "x argument could explain it" only marginally increases that argument's probability. And I'd expect, based on you finding this idea, that you'll come across many more possible explanations.

Destroying the fabric of the universe sounds hard even for a superintelligence. "hard": probably impossible even if the superintelligence makes it its only priority.

I'd be curious to see the source for the claim "ML agents in games have been observed to crash the game in order to avoid a negative reward." It sounds familiar.

"Since the AIs were more likely to get ”killed” if they lost a game, being able to crash the game was an advantage for the genetic selection process. Therefore, several AIs developed ways to crash the game. One was particular memorable, because it involved the combination of several complex actions to crash the game. These would have been hard to find by conventional beta testing, since it involved several phenomena human players would instinctively avoid."

https://cs.pomona.edu/~mwu/CourseWebpages/CS190-fall15-Webpage/Readings/2008-Gameplaying.pdf

Hello AI-doom,  

Isn't there a paradox here, though?  

If an AI destroys everything to guarantee an ideal outcome - wouldn't it want to make sure that it is successful? But how is it going to make sure that everything is destroyed? Is it going to not destroy itself? Well, then there is still something left that might malfunction or create this outcome, which is itself. So does it destroy itself simultaneously as it destroys the universe, or afterward? Maybe it is a failure of imagination on my part, but it seems a bit paradoxical.

Well, of course it might not care about anything besides its own computations - but in this case, Why does it want to achieve the goal through destroying the Universe and not just Itself, which seems to be the much easier course of action. Don't Ml agents do that? If you only evaluate things as 'true' in your mind, you can permanently avoid an outcome by shutting down your own mind. 

Which isn't to say that your concern is invalid or anything, it is just a paradoxical, and it seems odd from my amateur perspective that the AI doesn't choose the 'easiest' route. Want to guarantee that you don't see something: Eliminate yourself completely. 
You see this in games with human players, when people don't like an outcome. They Leave. More rarely have I seen people try to sabotage the game itself - even when it happens, of course. 
The reason I imagine is that at that point you come at odds with the other people that might be invested in that which you want to destroy, and as such might punish you for your actions - Whereas if you're leaving a game, aren't you more likely to be ignored?

I mean, I am looking at this from my amateur perspective, so I might be missing something obvious here, but thought it might still be a valuable comment to you somehow.
 


Kindly,
Caerulea-Lawrence