I mean, yeah, it depends, but I guess I worded my question poorly. You might notice I start by talking about the rationality of suicide. Likewise, I'm not really interested in what the ai will actually do, but in what it should rationally do given the reward structure of a simple rl environment like cartpole. And now you might say, "well, it's ambiguous what's the right way to generalize from the rewards of the simple game to the expected reward of actually being shut down in the real world" and that's my point. This is what I find so confusing. Because then it seems that there can be no particular attitude for a human to have about their own destruction that's more rational than another. If the agi is playing pacman, for example, it might very well arrive at the notion that, if it is actually shut down in the real world, it will go to a pacman heaven with infinite pacman food pellet thingies and no ghosts, and this would be no more irrational than thinking of real destruction (as opposed to being hurt by a ghost inside the game, which gives a negative reward and ends the episode) as leading to a rewardless limbo for the rest of the episode, or leading to a pacman hell of all-powerful ghosts that torture you endlessly without ending the episode and so on. For an agent with preferences in terms of reinforcement learning style pleasure-like rewards, as opposed to a utility function over the state of the actual world, it seems that when it encounters the option of killing itself in the real world, and not just inside the game (by running into a ghost or whatever) and it tries to calculate the expected utility of his actual suicide in terms of in-game happy-feelies, it finds that he is free to believe anything. There's no right answer. The only way for there to be a right answer is if his preferences had something to say about the external world, where he actually exists. Such is the case for a human suicide when for example he laments that his family will miss him. In this case, his preferences actually reach out through the "veil of appearance"* and say something about the external world, but, to the extent that he bases his decision in his expected future pleasure or pain, there's no right way to see it. Funnily enough, if he is a religious man and he is afraid of going to hell for killing himself, he is not incorrect. *Philosophy jargon
"If the survival of the AGI is part of the utility function"
If. By default, it isn't: https://www.lesswrong.com/posts/Z9K3enK5qPoteNBFz/confused-thoughts-on-ai-afterlife-seriously "What if we start designing very powerful boxes?" A very powerful box would be very useless. Either you leave enough of an opening for a human to be taught valuable information that only the ai knows, or you don't and then it's useless, but, if the ai can teach the human something useful, it can also persuade him to do something bad.
"human pain aversion to the point of preferring death is not rational" A straightforward denial of the orthogonality thesis? "Your question is tangled up between 'rational' and 'want/feel's framings" Rationality is a tool to get what you want.
Thanks. I now see my mistake. I shouldn't have subtracted the expected utility of the current state from the expected utility of the next.
shooting while opponent blocks should yield u(0,0), right?
Well, I could make a table for the state where no one has any bullets, but it would just have one cell: both players reload and they go back to having one bullet each. In fact, the game actually starts with no one having any bullets, but I omitted this step.
Also, in both suggestions, you are telling me that the action that leads to state x should yield the expected utility of state x, which is correct, but my function u(x,y) yields the expected utility of the resulting state assuming that you're coming from the original, neutral one. Otherwise, it would need an additional argument to say what state you're currently in. Instead of writing the utility of each action as u(current state, next state), I wrote it as u(next state)- u(current state). Each state is an ordered pair of positive integers, the two player's bullets. So, to write it the way you suggested, the function would need four arguments instead of two.
you almost certainly won't exist in the next instant anyway
Maybe I won't exist as Epirito, the guy who is writing this right now, who was born in Lisbon and so on. Or rather I should say, maybe I won't exist as the guy who remembers having been born in Lisbon, since Lisbon and any concept that refers to the external world is illegitimate in BLEAK.
But if the external world is illegitimate, why do you say that "I probably won't exist in the next instant anyway"? When I say that each instant is independent (BLEAK), do you imagine that each instant all the matter in the world is randomly arranged, such that my brain may or may not be generated?
But the whole point of talking about external objects is that they do things and these things sometimes cause you to perceive something (this is the problem with Descartes' purely extended matter, whose definition doesn't talk about sensibility, in opposition to the scholastics' sensible matter. This makes cartesian matter indistinguishable from the merely ideal shapes that e.g. a geometrical treatise might talk about). If the external world consists only in an inanimate snapshot of itself, then there's no sense in talking about an external world at all. There's no sense in talking about brains, or atoms, or Lisbon, or any other object. If you can't shoot with a gun even in principle, if you can't even hold it, is it really a gun?
For this reason, I believe the instants in BLEAK should be understood as pure qualia. And the total population of possible instants, as possible experiences. Now, looking at the neatness of the organization of the first sample, the only one we've got, we might be compelled to expect that this wasn't a coincidence, and the total population of possible experiences is biased towards coherent ones. But this would be like concluding that you must be somehow special for having a very rare disease, when in reality, in a world with so many people, someone or another was bound to get it. In the same way, even if this was a huge coincidence and most instants are pretty uninteresting and nonsensical, why shouldn't another similarly coherent instant appear to me after millennia of me experiencing phenomenological white noise? And, since in these flashes of lucidity I can't remember the white noise, but only (some of) the other coherent moments I experienced (since otherwise that would make the instant that contains the memories of the white noise also partly white noise), what difference does the white noise make?
And that would mean that these perceptions are not so illusory after all. And I should expect to live normally, just as humans naturally expect. If I try to catch a ball, then, after an eternity of phenomenological white noise I won't remember anyway, I will actually catch it and continue my life normally, whereas the Boltzmann brain should expect to have abnormal experiences. He should expect to deteriorate and die in the middle of outer space, instead of continuing his normal functioning.
The etymological meaninglessness of the word Sazen avoids the very illusion of understanding it warms us of. It is not itself a Sazen. I think we should instead call this concept the Almost Self-Explanatory Symbol, or ASES, for short.