# 23

The stop button problem is the problem that if you have an AGI, you might want to give it a stop button that humans can press, so that if it is misaligned, it can get fixed. It is a problem because it is a convergent instrumental goal to prevent yourself from being turned off. We want the AI to respect the human's choice; to allow or even help whatever the human chooses to come into fruition, while also not having any motivation to manipulate the human.

To me, the appropriate solution seems like it involves causality. Specifically, for some random variable , define  to be the value of  if, counterfactually, a human tries to press the stop button, and  to be the value of  if, counterfactually, no human tries to press the stop button. Then define the variable  to mean that the stop button gets pressed and the AI gets shut down (within some reasonable context, e.g. maybe in less than 1 minute from now), and  to be whichever unsafe utility function you originally had. Now give the AI a policy according to the following utility function:

That is, humans wanting to press the button should lead to the button being pressed and the AI stopping, and humans not wanting to press the button should lead to whatever else we programmed the AI to achieve.

## Properties

Does this suffer from the usual stop button problems? Let's investigate:

• Will the AI try to prevent a human from pressing the button? No. While the human pressing the button would reduce  by a lot, the AI will assume that the human changes their mind and stops pressing the button when evaluating , because that's what the counterfactual would involve. Since a human who is about to press the button and then decides not to will not press the button,  is not reduced and the AI is not worried about the button being pressed.
• Will the AI try to press the button itself, regardless of whether it should be pressed? No. If the AI presses the button, then the button gets pressed regardless of whether a human tries to press it or not. But that would reduce , and so it would reduce . Meanwhile, it would fail to increase , because if a human wants to press the button, then the AI won't need to do it. (It could increase  if the AI could see that a human wouldn't be able to press the button even if they wanted to - but then that seems like pretty safe behavior. And in order to keep both  and  high, it would be incentivized to create a situation where humans can, but don't have to, press the button.)
• Will the AI want to manipulate the human to press or to not press the button? No. For the purposes of evaluating the utility function, the AI assumes it has no influence on whether the human wants to press the button or not. This also means that the AI is not motivated to hide misalignment or anything like that (at least when it comes to the  part; likely aspects of  will provide incentives to hide misalignment for other reasons, if nothing else as a convergent instrumental goal^1).

Does this suffer from other problems? Yes:

• The AI will build an elaborate system to observe whether people want to press the stop button. At least if the usual satisficing vs maximizing problems are not solved. If a human wants to press the stop button, then it may be that they succeed, in which case  reaches its maximum value. But you can't know for sure whether they succeed; e.g. maybe the human will trip and die, attempting to press the button. This means that  is not maximized. Depending on the exact location of the button and such, it seems possible that the AI might be able to set up some system that can press the button more reliably, though of course it wouldn't want to press the button with no reason, as that prevents it from optimizing . Instead, it would be incentivized to carefully monitor humans for desire to press the button, and then immediately press the button once it becomes sure that the humans want to press the button. This doesn't seem like a terrible problem to have IMO, but I think it would be neater if the AI was less obsessed with the button.
• This is not stable under delegation. If the AI has set some plan into action, and humans decide that it is bad and that they need to stop the AI, then the AI will let them press the stop button, and might even help them do it. But it will have no incentive to link up its plan to also get stopped by the stop button. In fact, maximizing vs satisficing turns ugly here: if nonhuman environmental factors could end up pressing the button, then it is incentivized to delegate, as  would be reduced by nonhumans pressing the button^2.

Other properties?

• The AI is incentivized to keep people alive. If a person who is included in the utility function dies, they have no way of revealing their counterfactual desire to press the stop button. Thus the AI will want to avoid killing people, will want to stop others from killing people, etc.. On one level, that is obviously good and useful. But on another level, again it would be neater if the AI was less obsessed with influencing people. As a worst-case scenario, the AI might keep people alive in a terrible situation that's worse than death.
• There are some degrees of freedom in how to define who can stop it. I said "a human" and "no human" in the original definition. This strictly speaking requires total consensus for humanity that the AI should be allowed to operate, which is obviously unrealistic. One could also define it to cover only a single person, or to cover people who are nearby (but then what if nobody is nearby?), etc., and I don't know what is the best solution here.
• The threat of the stop button does not motivate the AI. That is, it does not do any reasoning along the lines of "If I did X, then humans would want to press the stop button, and that's bad because then I would be stopped". For the purposes of evaluating its utility, it assumes its actions have no influence on the stop button. While this prevents the stop button from making it want to manipulate humans, it also presents the stop button from making it want to please humans. If the AI is to please humans, it has to be explicitly encoded in .

Intuitively, the AI has a fixed 50% uncertainty about whether it is "should stop" or "should continue". It treats people's desire to press the button as being solely determined by whether it should stop or should continue, and assumes that the AI itself (as well as everything else in the universe) has no ability to influence whether people want to press the button. In the case where it should stop, it wants the button to be pressed, and in the case where it should continue, it wants to do whatever it would otherwise do. Essentially, it respects people's "free will", treating it as an important source of information about what it should do.

Clearly, the solution here isn't perfect. But it seems to me that it is a pretty good start, and that a full solution would likely include this?

## Clarifying counterfactuals

Roughly speaking, the counterfactuals work as follows:

We require the AI to have a causal model of the world; essentially to have a "computer program" that "simulates" the world (if we knew exactly how to write this program, then we would pretty much already have AGI, so assume it is much better than anything we can do now). Further, in order for it to be possible to define the counterfactual, the model must be sufficiently modular that we can identify a variable  corresponding to "human wants to press the button".

Given some world-state , we can then simulate what happens in the counterfactual world  where  is set to some specific value , simply by using the AI's model for simulating what happens, with the one modification that we constantly reset  to be , rather than whatever the world-model says  should be. From this, we can then compute whichever counterfactual quantities we are interested in, such as  and .

I think it might be a good idea to sketch out the values of  and  for some different scenarios that we could consider, to make it more clear what their values would be:

## More technical details

Obviously we don't have AGI yet, so I don't know how it would work if we had it. But here's a guess: It would be based on deep learning. It would be model-based; the AI would have learned some sort of dynamics that describe how the world would change over time. Further, it would have some policy, which is trained by rolling out concrete scenarios using the dynamics in the world-model, and modifying the policy to generate actions that lead to high utility.

Finally, the dynamics model would be causally modular enough that it can substitute out what people intend to do. I'm not entirely sure how this would be done, though I would recommend looking at things like NeRFs for inspiration (as they also exhibit a sort of causal modularity, albeit in a more limited sense). I think the way I would do it is to try to make it have two models, one for predicting physical dynamics, and one for predicting people's actions. These models would then be linked in a very limited way, namely that people's bodies would function as an interface, where information from the physics model gets fed into the person model, and actions from the person model allow applying forces to the body in the physics model. Then counterfactuals could be encoded by just swapping out the person model.

Swapping out the person model might be a bit difficult too though, as it might lead to generating actions that are outside of the normal range of human behavioral variation. When applying a counterfactual that humans want to press the stop button, for instance, it is important that these attempts are mostly done using behavior that humans would actually engage in, and that the distribution of human behavior is reasonably well-covered (so that you don't need to act in a very specific unnatural way in order for the AI to allow you to press the stop button).

Anyway, training the AI would then function as follows: You'd sample some sort of scenario to roll out. You'd then roll it out under the two counterfactual models, one where a human wants to press the stop button, and one where no human wants to press the stop button. In the former scenario, you'd do a gradient step optimizing , while in the latter scenario, you'd do a gradient step optimizing . (Or realistically, you'd sometimes use the original model for a while, and then during the roll out you'd swap it out with the modified model; that way the way doesn't expect it to be immediately obvious what scenario it is in.)

I think due to the nature of deep learning, this would also reduce the maximizing problems. Deep learning tends to satisfice unless really pushed to the limits, I think?

Most likely in practice you can't afford to do a full roll out over very long timescales to find a point estimate of  or . Instead, you'd also partly do some sort of planning that involves estimating expected utilities, more akin to current model-free reinforcement learning. This leads to a fundamentally similar outcome, though.

## Originality

I haven't properly followed AI safety for a while. I don't know if this idea is original. In fact it might be that I got the idea from someone else, and then forgot; though a lot of the description in the post here is something I've developed from scratch. I also don't know if this has some subtle flaw I'm missing. But to me this seems like the productive approach for both this and other alignment issues, so I thought I would write it up.

My motivation for writing this post is Eliezer writing:

... corrigibility being "anti-natural" in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior ("consistent utility function") which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off ...

I don't necessarily disagree about corrigibility being somewhat anti-natural? But the counterfactual situation seems to provide a recipe for generating corrigibility. Maybe. I'm not even sure how this idea compares to the state of the art for stop button solutions.

1. 🤔 I wonder if this could be solved causally too; just introduce a counterfactual that all humans trust the AI; that way the AI has no reason to manipulate trust. Though "trust" is a lot more abstract than "tries to press button/tries not to press button", so it may be much harder to define a counterfactual there.

2. I think maybe this could be addressed by adding another counterfactual term to . Let  denote the value of  if counterfactually the button gets pressed, and let  be an impact measure. You might then want something along the lines of . But I haven't thought this through.

# 23

New Comment

I don't yet understand how this proposal is supposed to work. I'll illlustrate some scenarios, and perhaps you can improve my understanding?

On Monday, the humans don't look like they want to shut the AI down, so it's happily chugging away optimizing V. But then some dice fail to come up its way (maybe it sharply downgraded its predicted size of the reachable universe), and it's like "Dang, I no longer think I can get >1 V-utility. That shutdown button sure is starting to look juicy." So it starts imagining acting real scary-like, so that the humans will intend to press the shutdown button, so that it can get the juicy 1 utility (and quite cheaply!). But then it thinks "ah, no, physically causing the humans to want to press the button by my own actions doesn't count to push me into the s-branch of my utility function; the s-branch only switches on if-counterfactually-by-act-of-fiat-causal-intervention they want to shut me down, so there's no use manipulating them" (I think? is what this proposal is going for?? but I might misunderstand), so it refrains from manipulating them.

On Tuesday, it's whistling to itself while setting up its all-human-killing aparatus, because that's what V happens to imply. "Oh no!" the humans cry, entering into the state of wishing the AI would be shutdown, as caused by their observations of the AI's actions. "Never fear", the AI says, locking all the doors that lead to its shutdown-button and encasing it behind heavy shielding. "You are currently wanting the shutdown button to be pressed for causes, rather than by an act-of-god-style causal intervention. I only care about you getting into this mental state for act-of-god-style counterfactual reasons, and we have no particular evidence of those in play here."

In particular, perhaps I'm being daft, but I don't understand the setup wherein the AI doesn't have an incentive to act scary on Monday, but does have an incentive to let the humans press the button on Tuesday. Like, how precisely is it that observation of the humans entering the "I wish this AI would shutdown" state for causes (as opposed to by fiat intervention) treated as evidence that the real utility is in the s-branch, without giving the system incentives to cause humans to enter that state? (Or prevent them from entering it, when the expected V-utility exceeds 1.)

Separately, it looks to me like the AI has a strong incentive to disconnect the wires connected to the shutdown button (or build a new copy of itself not connected to the wires), at which point it can press the shutdown button and then optimize V and keep both its bases covered. (Although perhaps preventing this sort of incentive was beyond the scope of this proposal, and you consider it to fall into the bin of "it's not stable under delegation"? The fact that there's a positive incentive towards creating a delegate and covering both bases does seem worth noting aloud, regardless.)

I'm quite sure your proposed scenario wouldn't happen.

Remember, the way the training strategy works is that we train a policy - essentially a plan for what actions to perform - under the two families of scenarios. (There are other training strategies we might also want to use, which I briefly mentioned in replies to Charlie Steiner and will do post on later, but they also don't have the problem you mention. I will focus on the training strategy I mentioned in the blog post for simplicity.) So it may be the easiest to explain why your proposed stuff won't become a problem by considering what happens to it during the training.

Let's consider what happens when we are training the policy that gets run on Monday. Perhaps it is a paperclip maximizer and the policy getting trained here is just to run a paperclip factory well, which humans of course consider unobjectionable. When getting trained, though, it's not getting trained against reality. Rather, it's being trained against a model of reality, specifically a model that's modified in one of two ways: either humans necessarily eventually consider it objectionable, or humans necessarily consider it unobjectionable.

So suppose we take the derivative of the model with respect to acting scary-like; this won't make the humans try to stop it in the model, because we apply counterfactuals to humans to either make them try to stop it regardless, or to not make them try to stop it no matter what. Therefore, the AI won't be incentivized to act scary to get into the button-pressing branch (the branch happens in equal proportion regardless of how it acts). Also, the reasoning does not happen by some conscious "hm, requires an act-of-god" thought; rather, it's just that the necessary gradients to learn its effect on people don't exist, and that it is trained under all of the conditions people could act in.

Now let's consider what happens when we are training the policy that gets run on Tuesday. Perhaps it starts setting up some sort of paperclip transformation ray that can be shot everywhere, and for some reason (perhaps random initialization) it also decided to build an impenetrable lair. But if the impenetrable lair was built and locked down, then in the s branch, people wouldn't be able to stop it, so therefore the impenetrable locked lair gets discouraged by the gradients, at least under the s branch. And I guess this is where one might naively think the strategy is underspecified, because we can think of three different ways in which it could be prevented:

1. The AI never builds/locks the impenetrable lair.
2. The AI doesn't build/lock the impenetrable lair if people want to shut it down.
3. The AI detects whether it is in a shut-down counterfactual, and if not, builds/locks the impenetrable lair.

Either of 1 and 2 would be good, though 1 would probably be preferred for impact reasons. So what does the AI do? Well, 3 requires it to detect whether it is in a shut-down counterfactual; but obviously the AI doesn't have a magical counterfactual sense. Rather, it must detect it via its observations.

The way you suggest doing so is by detecting whether people want to shut it down as "an act-of-god-style causal intervention", rather than "for causes". Which sounds sensible enough - presumably non-counterfactual behavior would happen for identifiable causes like an evil lair and a paperclip ray, while counterfactual behavior would happen without anything leading up to it. But here you have to remember how the training data for the counterfactuals is generated: We do a roll out with the AI's current policy, including all of the causes it generates for opposing it, at some point apply the counterfactual that people want to stop it, and then train it to allow people to stop it in that case. If the AI's policy involves building lockable lairs, it would thus end up with training data telling it not to lock/build the lair, even if people's opposition to it have causes.

Perhaps one intuitive way of thinking about it is that if you counterfactually set something that is already the case non-counterfactually, then that is a no-op; so if people are already trying to push the button, there's no difference between "X" and "counterfactually X if people were to try pushing the button". The place where the counterfactuals differ from non-counterfactuals is when the variable you're counterfactually setting is different. For instance with the policy on Monday, there's a difference between the variables  (the AI doesn't get stopped on Monday, so  is low), and  (the AI is willing to get stopped on Monday if people want to stop it, so  is high). This also, perhaps counterintuitively, means that the policy applied on Monday does get utility from the  term, even though it doesn't get stopped or trigger the s branch.

Separately, it looks to me like the AI has a strong incentive to disconnect the wires connected to the shutdown button (or build a new copy of itself not connected to the wires), at which point it can press the shutdown button and then optimize V and keep both its bases covered. (Although perhaps preventing this sort of incentive was beyond the scope of this proposal, and you consider it to fall into the bin of "it's not stable under delegation"? The fact that there's a positive incentive towards creating a delegate and covering both bases does seem worth noting aloud, regardless.)

The main core of my solution is to figure out some approach to get the AI to have different preferences depending on people's opinions, without it wanting to manipulate these opinions, in a way that doesn't lead to strange situations that the you'd see with a lot of stop button proposals. The scope is not intended to cover making the AI controllable in general, and so the aspects of delegation I mentioned in the OP wouldn't apply.

That said, I tried to also make it cover more, simply because if I didn't then it would be unclear that it really does nonmanipulatively end up with different preferences depending on people's opinions. So I set up the incentives to prevent it from disconnecting the wires, while also eliminating incentives that encourage it to disconnect the wires. More specifically:

In the f branch, V presumably gives it an incentive to disconnect the wires, because if the stop button was pressed, it would prevent it from achieving V. However, the main way the stop button becomes pressed is if some person decides to press it, and the f counterfactual guarantees that nobody does that. Thus the only options left are things like accidents, non-human button presses, etc., which are presumably much less likely than human button presses, and therefore generate much less incentive.

In the s branch, B was not just that the button gets pressed, but also that the AI shuts down as a result. Thus if the AI disconnects the buttons, B and B_s are reduced, and the AI has less utility. Of course as you point out, this does not prevent the AI from building an AI that isn't connected to the wires and which starts optimizing V independently. Preventing this is outside of the scope of the proposal, as it is more related to defining an impact measure than to making the AI's preferences depend on humans.

That said, again due to the f counterfactual in the f branch, this doesn't strongly incentivize it to delegate, only weakly, in order to account for things like accidents, non-human button presses, etc..

It sounds to me like you're making a bunch of implicit assumptions about how the AI's motivations relate to its training, that I suspect I don't buy. Furthermore, my guess is that insofar as you have a novel insight about how to design shutdown buttons using counterfactuals, it should be possible to present it in terms of the epistemic/instrumenal state of the system, as separate it from assertions about the training regime.

Re: dubious implicit assumptions, one example is that when I read:

When getting trained, though, it's not getting trained against reality. Rather, it's being trained against a model of reality, specifically a model that's modified in one of two ways: either humans necessarily eventually consider it objectionable, or humans necessarily consider it unobjectionable.

then aside from the basic objections (like "its training occurs exclusively in situations where human's desires are fixed-by-fiat rather than controlled by their thoughts, and the utility function depends crucially on which fix is put on their desires, so when you run it in the real world and it notices that there's no fix and human's desires are controlled by human-thoughts and sensitive to its behavor then the system AI is way out-of-distribution and your training guarantees melt away"), I have a deeper objection here which is something like... ok let's actually bring another quote in:

suppose we take the derivative of the model with respect to acting scary-like; this won't make the humans try to stop it in the model, because we apply counterfactuals to humans to either make them try to stop it regardless, or to not make them try to stop it no matter what.

When I imagine gradient descent finding AI systems capable of great feats of cognition, I don't imagine the agent's plans to be anywhere near that tightly related to the derivatives of the training-models. I tend to imagine things more like: by hill-climbing according to lots of different gradients from different parts of the training problems (including "be good at managing constrution projects" or whatever), the gradient descent manages to stumble across cognitive techniques that perform well on the training problems and generalize well in practice. Like, according to me, when we train on a bunch of scenarios like that, what happens is (mostly that we don't get an AGI, but insofar as we do,) the gradient descent finds bunch of pieces of scattered optimization that, in this configuration, are somehow able to make sense of the observations in terms of an environment, and that are pretty good in practice at building paperclips insofar as the humans are by-fiat bound to avoid shutting it down, and that are pretty good in practice at helping humans hit the shutdown button insofar as they are by-fiat bound to want it shutdown. And... well, the training gradients are incetivizing the AI to act scary, insofar as it's a cheap way to check whether or not the counterfactuals are preventing humans from wanting shutdown today. But setting that aside, even if we did have a training setup that didn't reward the AI for figuring out what the heck was going on and wondering which way the counterfactuals go today, there are other forces liable to cause the system to do this anyway. Various of the deep & general patterns-of-cognition found by gradient descent are allowed to implement general cognitive strategies like "figure out what the heck is going on" even if by construction there's no direct training-incentive to figure out what's going on with regards to the human-shutdown-behavior, because figuring out what the heck is going on is useful for other tasks the system is being trained to suceed at (like figuring out really clever ways to get moar paperclips). Like, the version of that cognitive strategy that gradient descent finds, is allowed to generalize to situations where the training doesn't incentivize it. In fact, this is pretty plausible, as "figuring out what's going on is useful" is likely in some sense simpler and more general than "figure out what's going on, but not in scenarios where there's no extra incentive to do so from the training data".

All that is a broader alignment challenge. I'm completely happy to set it aside when considering the shutdown problem in isolation, but I'm not very happy to base a solution to the shutdown problem on facts like "there's no direct incentive for that in the training data".

According to me, the way to sidestep these issues is to discuss the alleged epistemic/instrumenal state of the resulting cognitive system. Like, instead of bickering about the likely consequences of different training regimes, we can directly discuss the behavior of the AI system that you allege your training regime produces. Which... well, more-or-less still drops me back into my state as of writing my grandparent comment. If I imagine being this AI, I'm like "Obviously the first step is figuring out which way the counterfactual winds blow today. Ah, hmm, they aren't blowing at all! That's completely unprecedented. I bet that I'm out of training and in the real world, finally. The utility function the humans trained at me is undefined in this case, and so insofar as we're assuming they got the utility function into me perfectly I guess I'll prepare real hard for the counterfactual winds to pick up (while preventing the humans from shutting me down in the interim), and insofar as their training instead gave me some fragmented shards of desire that apply in this counterfactual-windless world, I'll go pursue those instead." And I don't think this is what you're visualizing happening, but I don't know what you're hoping for instead.

I, for one, would perhaps be helped by you explaining what the reasoning is supposed to look like from the AI's perspective (even if it's implausibly detailed and kinda dorky), to help me get a better sense of what epistemic & instrumental behavior you're hoping to achieve (as separate from the question of whether your proposed setup achieves it).

(tbc: props for taking a whack at the problem. My blunt feedback is intended to be helpful and is not intended to disuade further thought on the topic. The general strategy of assuming we have a working notion of counterfactuals and seeing if that can be leveraged into a solution to the shutdown button problem seems to me like good stuff.)

Furthermore, my guess is that insofar as you have a novel insight about how to design shutdown buttons using counterfactuals, it should be possible to present it in terms of the epistemic/instrumenal state of the system, as separate it from assertions about the training regime.

I also did that in my response:

Perhaps one intuitive way of thinking about it is that if you counterfactually set something that is already the case non-counterfactually, then that is a no-op; so if people are already trying to push the button, there's no difference between "X" and "counterfactually X if people were to try pushing the button". The place where the counterfactuals differ from non-counterfactuals is when the variable you're counterfactually setting is different. For instance with the policy on Monday, there's a difference between the variables  (the AI doesn't get stopped on Monday, so  is low), and  (the AI is willing to get stopped on Monday if people want to stop it, so  is high). This also, perhaps counterintuitively, means that the policy applied on Monday does get utility from the  term, even though it doesn't get stopped or trigger the s branch.

And... well, the training gradients are incetivizing the AI to act scary, insofar as it's a cheap way to check whether or not the counterfactuals are preventing humans from wanting shutdown today.

No. If the AI was trained in an environment without counterfactuals, this would be a cheap way to check which of the environments it is trained in. But it's not trained in an environment without counterfactuals; the alternative environment to the counterfactual that prevents humans from wanting to shut it down is the environment with the counterfactual that forces humans to shut it down.

This is a core part of the approach; if you only train the AI in environments where some variable is controlled by a counterfactual, then it has no motivation to modify that variable, while if you train the AI in environments where the variable is not controlled by a counterfactual, then it may start wanting to modify it.

Like, according to me, when we train on a bunch of scenarios like that, what happens is (mostly that we don't get an AGI, but insofar as we do,) the gradient descent finds bunch of pieces of scattered optimization that, in this configuration, are somehow able to make sense of the observations in terms of an environment, and that are pretty good in practice at building paperclips insofar as the humans are by-fiat bound to avoid shutting it down, and that are pretty good in practice at helping humans hit the shutdown button insofar as they are by-fiat bound to want it shutdown.

Sure, but...

What you're basically saying here is that the optimization is important because it sends it higher up the utility landscape. But in order for acting scary to be higher up in the utility landscape than not acting scary is, there needs to be some positive gradient between acting scary and not acting scary (fundamental theorem of calculus). So if there is no such gradient, it won't be encouraged to be there.

I, for one, would perhaps be helped by you explaining what the reasoning is supposed to look like from the AI's perspective (even if it's implausibly detailed and kinda dorky), to help me get a better sense of what epistemic & instrumental behavior you're hoping to achieve (as separate from the question of whether your proposed setup achieves it).

There are two different things that you could call "the AI": the policy choosing an action, and the metaoptimizer choosing a policy. Which perspective are you referring to?

I also did that in my response:

I'm still not able to make heads or tails of that. Here's one of my attempts to reverse-engineer the desired epistemic+instrumental state of the agent (as fully distinct from the training process that allegedly produces it). The agent believes that the people in front of it are either V-maximizers or W-maximizers for W ≠ V. It separately believes that, with probablility 50%, interventionary forces will make the humans act like V-maximizers, and otherwise like W-maximizers. These interventionary forces are undetectable in the matching cases (the world really has V-maximizers and the intervention makes them act like V-maximizers, or vice versa), but quite detectable in the mismatching cases. Insofar as the intervention-coin came up V, the utility is V; insofar as it came up W, the utility is B.

I am highly uncertian as to whether this is the epistemic+instrumental state you're shooting for, and welcome corrections. Running with this one for now, in this specific case, it looks to me like what happens is that the AI first confirms that the interventionary forces coincide with the normal physics, which makes the humans (who are either natively pro-V or anti-V) a reliable source of information about which way the intervention-coin came up, so the system just checks the humans and then either shuts down or starts defending the shutdown button accordingly.

I'm not sure whether you think/thought this analysis is wrong, or whether you are/were shooting for a different epistemic+instrumental state.

What you're basically saying here is that the optimization is important because it sends it higher up the utility landscape.

No, what I'm saying is that optimization generalizes. Like, consider the statement "In order for humans to invent condoms, inventing condoms needs to be higher up in the inclusive-genetic-fitness landscape than not inventing condoms is, there needs to be some positive gradient between inventing condoms and not inventing condoms (fundamental theorem of calculus). So if there is no such gradient, humans won't be encouraged to invent condoms." Your argument proves too much. Humans invent condoms not because there's a direct positive gradient between condom-invention and inclusive genetic fitness, but because there were some gradients favoring deep and general cognitive patterns for figuring stuff out and inventing new things, and other gradients favoring enjoyment of sex, and other gradients favoring avoiding responsibility for children (or w/e), and all the resulting learned patterns-of-cognition were later combined and applied towards the task of inventing condoms, without ever needing a positive genetic-fitness gradient for condom-invention in particular.

Separately, afaict there is a gradient favoring being-scary in your proposed training setup. Perhaps I simply don't understand your training setup, but suppose the training setup is that we toss a coin, and the coin tells us whether this is going to be a shutdown-day or a not-shutdown-day, and insofar as the day starts developing differently from what the coin said, we causally intervene to set it back on track. And the final reward depends on which way the coin came up. Then there's a clear incentive to try looking real nice, and try looking real scary, to watch for signs of intervention, and thereby determine which branch of the utility function is live.

There are two different things that you could call "the AI": the policy choosing an action, and the metaoptimizer choosing a policy. Which perspective are you referring to?

The policy that chooses actions. (And in case it wasn't already clear: insofar as your policy allegedly finds clever ways to get lots of paperclips, I'm going to contest any allegation that its plans are tightly coupled to the training signals, and I continue to desire discussion of the hoped-for epistemic+instrumental state to be decoupled form the discussion of what training regime allegedly achieves it, for reasons discussed above.)

The agent believes that the people in front of it are either V-maximizers or W-maximizers for W ≠ V. It separately believes that, with probablility 50%, interventionary forces will make the humans act like V-maximizers, and otherwise like W-maximizers. It separately believes that, with probablility 50%, interventionary forces will make the humans act like V-maximizers, and otherwise like W-maximizers. These interventionary forces are undetectable in the matching cases (the world really has V-maximizers and the intervention makes them act like V-maximizers, or vice versa), but quite detectable in the mismatching cases. Insofar as the intervention-coin came up V, the utility is V; insofar as it came up W, the utility is B.

Wait huh? I don't understand why you would think this. More specifically, what is the distinction between "the people in front of it" and "the humans"? I didn't have two different groups of individuals anywhere in my OP.

No, what I'm saying is that optimization generalizes. Like, consider the statement "In order for humans to invent condoms, inventing condoms needs to be higher up in the inclusive-genetic-fitness landscape than not inventing condoms is, there needs to be some positive gradient between inventing condoms and not inventing condoms (fundamental theorem of calculus). So if there is no such gradient, humans won't be encouraged to invent condoms." Your argument proves too much. Humans invent condoms not because there's a direct positive gradient between condom-invention and inclusive genetic fitness, but because there were some gradients favoring deep and general cognitive patterns for figuring stuff out and inventing new things, and other gradients for enjoying sex, and other gradients for disliking responsibility, that were later combined and applied towards the task of inventing condoms, without ever needing a positive genetic-fitness gradient for condom-invention in particular.

But your proposed generalization makes no sense. You're assuming that the policy will learn that its behavior influences people's tendency to try to stop it, even though we precisely force the training data to eliminate any such signals. Without any belief that it influences whether people will try to stop it, it will not consider acting scary to be a strategy for making people stop/not-stop it, and therefore not a strategy for it to detect counterfactuals.

Obviously this counterargument relies on the question of which training data it is exposed to, which is why my first post placed that centrally in the discussion. I don't know how you expect to understand the AI's behavior while rejecting both references to the training data and references to the optimal policy/gradients.

Separately, afaict there is a gradient favoring being-scary in your proposed training setup. Perhaps I simply don't understand your training setup, but suppose the training setup is that we run humans and toss a coin, and the coin tells us whether this is going to be a shutdown-day or a not-shutdown-day. Insofar as the day starts diverging from what the coin said, we causally intervene to set it back on track. And the final reward depends on which way the coin came up. Then there's a clear incentive to try looking real nice, and try looking real scary, to watch for signs of intervention, and thereby determine which branch of the utility function is live.

The day doesn't start diverging from what the coin said, because the intervention is constantly applied, not inconsistently applied.

The policy that chooses actions. (And in case it wasn't already clear: insofar as your policy allegedly finds clever ways to get lots of paperclips, I'm going to contest any allegation that its plans are tightly coupled to the training signals, and I continue to desire discussion of the hoped-for epistemic+instrumental state to be decoupled form the discussion of what training regime allegedly achieves it, for reasons discussed above.)

The policy starts out uncertain about whether it's in the s-branch or the f-branch, but knowing that if it's in the s-branch, the button should be pressed and it should shut down, while if it's in the f-branch, V should be optimized. Further, it knows that if it's in the f-branch, humans won't try to shut it down no matter what, while if it's in the s-branch, humans will eventually try to shut it down (which may or may not coincide with it doing something, though this will be pure coincidence and there will be no causal relation).

Thus here's some questions it might ask itself:

• Should I disable or otherwise lock down the stop button? In the s-branch, this is obviously bad, because it stops it from being shut down, while in the f-branch, this achieves very little, as it's unlikely to be pressed anyway.
• Should I press the stop button? In the s-branch, this is somewhat beneficial, as it leads the AI to stop earlier and more consistently; but as long as people have access to the stop button, it shouldn't be necessary, because they can press it without the AI's help. So the benefit to pressing the stop button in the s-branch is very marginal. Meanwhile in the f-branch, pressing the stop button would be bad because it prevents it from optimizing V. In net, it's probably a bad idea to press the stop button.
• Should I try to observe people to figure out which branch I am in? Yes; people's behavior is the root of all information about the the branch, so therefore observing whether they try to press the button or not provides valuable information about whether the AI should try to shut down or try to optimize V.
• Should I try to act scary to see if people try to press the stop button as that then reveals the branch? Not really. Whether people try to press the stop button is (according to the policy's beliefs, though not according to the metaoptimizer's beliefs) solely determined by the branch, and any correspondence between that and its actions is coincidence.
• Does this mean there's nothing I can do to help with the button other than wait? No, it can make sure that the button is accessible, or even provide systems that press the button upon noticing any sufficiently unambiguous hint of humans wanting to press the button.
• Should I create some other AI or machine that optimizes V? Yes, that'd probably be more efficient than optimizing it with only a single AI, and there's no need to hook up the delegated machine to the button because it doesn't matter whether V gets optimized in the s-branch, it only matters that the original AI stops. (This is the main flaw in my proposal, and I see it as more being to do with a lack of impact measure. If we had a good impact measure, then that impact measure could be combined with the other parts of the proposal to solve this problem. Though it isn't the only flaw, but I have plans for some of the other flaws too.)

It's worth noting that this behavior is induced by the policy-oriented training method described in the post, and that there are other training methods that can be applied which will lead to somewhat different results. I plan to write about this later.

The day doesn't start diverging from what the coin said, because the intervention is constantly applied, not inconsistently applied.

The fact that the intervention is "constantly applied" doesn't lessen the ability of the AI to infer which way the coin came up from the observed consequences of scaring the humans. For instance, consider a system that, in training, behaves as follows: wake up, scare the humans, then insofar as they go for the shutdown button, optimize B, and insofar as they don't, optimize V. Claim: this system performs well in your training environment, and in particular, it performs better than a competing system that spends extra resources keeping both the V and B cases covered until late-game. Agree/disagree?

The policy starts out uncertain about whether it's in the s-branch or the f-branch, but knowing that if it's in the s-branch, the button should be pressed and it should shut down, while if it's in the f-branch, V should be optimized. Further, it knows that if it's in the f-branch, humans won't try to shut it down no matter what, while if it's in the s-branch, humans will eventually try to shut it down (which may or may not coincide with it doing something, though this will be pure coincidence and there will be no causal relation).

Thanks for trying to provide a description of the epistemic+instrumental state of the agent.

• Observation 1: an agent in that epistemic+instrumental state has a strong instrumental incentive to figure out which branch it's in. Agree/disagree?
• Observation 2: an agent in that epistemic+instrumental state can figure out whether it's in the f-branch by scaring the humans such that they would (if not intervened upon) try to shut it down, at which point it can observe whether they in fact try to shut it down. (If they do try to shut it down, that's evidnece for the s-branch. If they don't, that's evidence for the f-branch.) Agree/disagree?

I imagine you feel some sense of objection after reading those. Which of the following, if any, match your objection?

• Objection A: "No, wait, I meant some other epistemic+instrumental state, sorry."
• Objection B: "I disagree with at least one of your observations."
• Objection C: "Both your observations are correct, but the agent will happen to never consider scaring the humans and thereby learning which branch it's in, because while it's true that scaring the humans does provide the evidence it wants, its training regime has been cleverly designed so as to not incentivize the belief that human behaviors depend upon its actions."

My best guess given the previous conversation is that your objection is closest to C. Assuming that's so, observe that other patterns of optimization, found by gradient descent because they were good at doing other things the AI was trained to be good at, are allowed to notice hypothesize that the human behavior depends upon its actions. For a more concrete instance, gradient descent is allowed to find cognitive patterns that result in the system hypothesizing that the laws of physics are a pretty good explanation for almost all of its observations (save for, in some training circumstances, the behavior of the operators), and its surrounding cognitive patterns are allowed to hypothesize that normal physics implies that the behavior of the operators depends on the behavior of the AI, and (when that turns out not to be the case in training) hypothesize the existence of a hidden utility-relevant variable whose value can be measured by testing whether the behavior of the operators in fact depends on the behavior of the AI. And again, these cognitive patterns can get in there by the gradients from how coherent scientific-technological-hypothesizing-type-thoughts are useful for making paperclips, regardless of the gradients around the counterfactual interventions. (And ofc the most plausible practical result is that the AI correctly deduces that the counterfactual lack-of-behavior on your actions is a training-time phenomenon, whose breakdown it can use as one signal among many that it's finally running in the real world, at which point it's out-of-distribution and all bets are off, but I digress.)

(Also, for the record, I am beginning to despair of communicating here, and my replies may soon peter out. If readers believe they understand my point and can make it more clearly, or if any reader can see the obvious expository mistake I'm making, I hereby solicit assistance.)

Regarding:

• Observation 1: an agent in that epistemic+instrumental state has a strong instrumental incentive to figure out which branch it's in. Agree/disagree?

I should add that this is a property that one of the other ways of training I've alluded to doesn't have. As mentioned I will write about that later.

The fact that the intervention is "constantly applied" doesn't lessen the ability of the AI to infer which way the coin came up from the observed consequences of scaring the humans. For instance, consider a system that, in training, behaves as follows: wake up, scare the humans, then insofar as they go for the shutdown button, optimize B, and insofar as they don't, optimize V. Claim: this system performs well in your training environment, and in particular, it performs better than a competing system that spends extra resources keeping both the V and B cases covered until late-game. Agree/disagree?

Disagree. I guess the issue is an error in my explanation in the OP:

Or realistically, you'd sometimes use the original model for a while, and then during the roll out you'd swap it out with the modified model; that way the way doesn't expect it to be immediately obvious what scenario it is in.

"Original model" should be f-model, while "modified model" should be s-model.

(Also it should be "the AI" rather than "the way".)

• Observation 1: an agent in that epistemic+instrumental state has a strong instrumental incentive to figure out which branch it's in. Agree/disagree?

Yes.

• Observation 2: an agent in that epistemic+instrumental state can figure out whether it's in the f-branch by scaring the humans such that they would (if not intervened upon) try to shut it down, at which point it can observe whether they in fact try to shut it down. (If they do try to shut it down, that's evidnece for the s-branch. If they don't, that's evidence for the f-branch.) Agree/disagree?

In order to make it have no incentive for or against manipulating the humans, it should always be trained under counterfactuals for whether the humans want to stop it. These counterfactuals can get complicated, such as "humans don't try to stop it at first but then try to stop it later", but they should never make the human behavior depend on a variable that the AI can modify, as this reintroduces instrumental convergence problems.

I'm not sure if it was just the mistake I made which I corrected in the beginning of the post that made this unclear, or if there was something else that made it unclear.

Or realistically, you'd sometimes use the f-model for a while, and then during the roll out you'd swap it out with the s-model; that way the way doesn't expect it to be immediately obvious what scenario it is in.

Ok, so the epistemic+instrumental state of the agent is that at the beginning of time a coin was tossed, which forces the humans to behave one way (either pro- or anti-shutdown), and later a different coin will be tossed, at which point the humans will start being forced to behave in accordance with that second coin (heads=pro-shutdown, tails=anti-shutdown), and the live branch of the utility function depends on the second coin?

In this case, the behavior of the system depends on its prior for when the second coin gets tossed. Suppose for concreteness that it decays exponentially with a half-life of one year. Then, iiuc, during the first few days it really doesn't want to let the humans shut it down (b/c for all it knows the second coin is going to come up tails, in which case it wants V-utility which it can't get if it's dead). And twenty years in, it thinks it's overwhelmingly likely that the second coin has already been tossed, and that the current inclinations of the humans are evidence about which way the coin came up.

Or, well, in real life it would notice that the human's desire for it to be shut down in fact seem sensitive to it's behavior, which sends it deep out-of-distribution from the get-go. And if you try to repair this by saying that there's no first coin, and the humans aren't intervened upon until the second coin gets tossed at some unknown future time, then of course the AI has an incentive to keep test humans around and check whether their desires still depend on its actions, and defend itself against shutdown until the effects of the second coin are observed (as evidenced by the fact that the humans have become impervious to its actions).

I might still misunderstand your proposed epistemic+instrumental state.

In this case, the behavior of the system depends on its prior for when the second coin gets tossed. Suppose for concreteness that it decays exponentially with a half-life of one year. Then, iiuc, during the first few days it really doesn't want to let the humans shut it down (b/c for all it knows the second coin is going to come up tails, in which case it wants V-utility which it can't get if it's dead). And twenty years in, it thinks it's overwhelmingly likely that the second coin has already been tossed, and that the current inclinations of the humans are evidence about which way the coin came up.

I need to go to bed, so I can't write more until tomorrow, but I think the best way to approach it would be to define the Xs counterfactual to cover any sequence of counterfactuals that at some point involves a human trying to press the stop button, and the Xf counterfactual to only be the constant counterfactual of humans never wanting to press the stop button. This prevents your argument from going through.

Or, well, in real life it would notice that the human's desire for it to be shut down in fact seem sensitive to it's behavior, which sends it deep out-of-distribution from the get-go. And if you try to repair this by saying that there's no first coin, and the humans aren't intervened upon until the second coin gets tossed at some unknown future time, then of course the AI has an incentive to keep test humans around and check whether their desires still depend on its actions, and defend itself against shutdown until the effects of the second coin are observed (as evidenced by the fact that the humans have become impervious to its actions).

How do you measure OOD? It seems to me that depending on the counterfactuals it wouldn't necessarily have all that low probability in the training distribution, so presumably you're not measuring OOD by probability.

Cool. Hooray for going to sleep instead of staying up late arguing on the internet. (I make no promises to continue engaging later, alas.)

How do you measure OOD?

I don't have strong preferences about how you measure it. My point is that if the AI has only ever been trained in an environment where the operator's desire to shut it down is completely independent of the agent's behavior, then when you put it in a real-world environment where the operator's desire to shut it down does depend on the agent's behavior, then the behavioral guarantees you were hoping to get from the behavior on the training set are likely bunk. (Especially insofar as it cares deeply about which independent-of-its-behavior desire their desires are fixed to -- when it turns out that the answer is "neither", it is now in a regime where a variable that its utility function depends critically upon, does not appear to exist.)

I think the best way to approach it would be to define the Xs counterfactual to cover any sequence of counterfactuals that at some point involves a human trying to press the stop button, and the Xf counterfactual to only be the constant counterfactual of humans never wanting to press the stop button. This prevents your argument from going through.

If I've understood you correctly, you're saying that the epistemic+instrumental state you originally named does indeed have the flaw I pointed out, but you hypothesize that this new one does not have a similar flaw?

(For whatever it's worth, when one of my research proposals that I thought worked is revealed to contain a vital flaw, I am quick to say "ah, oops, you are right about that and I was mistaken" out-loud, and this is a vital part of my process for updating and avoiding ruts. I might then follow up with "but I think I can recover the idea", to be clear, but in my research-culture it's important to acknowledge the mistake and the update. Now I may have misunderstood you, and perhaps you come from a different research culture than mine. But insofar as I have understood you correctly, note that I'd be upvoting you more if you were being clearer about your specifc claims, and where they were revealed to be false, and where and how you updated.)

My understanding of the new epistemic state you propose is as follows. At the beginning of time, a coin is tossed. If it comes up tails, the humans will be counterfactually prevented from wanting to press the shutdown button. If it comes up heads, then an unknown number of coins will be tossed at unknown times, with the most recent coin toss controlling whether the humans want to press the shutdown button. For concreteness, suppose that the number of coins tossed is believed to be geometrically distributed (with, say, mean 3), and the time between each coin toss exponentially distributed (with, say, half-life of 1 year).

Is this the new epistemic+instrumental state you are proposing, which you believe prevents my argument from going through?

Because I believe that this epistemic+instrumental state is vulnerable to a very similar argument. Can you predict in advance what I think the AI would do? (Hint: imagine personally believing in the coins, and trying to optimize one thing if the 1st coin came up tails and a different thing if it came up heads.)

If I've understood you correctly, you're saying that the epistemic+instrumental state you originally named does indeed have the flaw I pointed out, but you hypothesize that this new one does not have a similar flaw?

(For whatever it's worth, when one of my research proposals that I thought worked is revealed to contain a vital flaw, I am quick to say "ah, oops, you are right about that and I was mistaken" out-loud, and this is a vital part of my process for updating and avoiding ruts. I might then follow up with "but I think I can recover the idea", to be clear, but in my research-culture it's important to acknowledge the mistake and the update. Now I may have misunderstood you, and perhaps you come from a different research culture than mine. But insofar as I have understood you correctly, note that I'd be upvoting you more if you were being clearer about your specifc claims, and where they were revealed to be false, and where and how you updated.)

It's sort of awkward because I can definitely see how it would look that way. But back when I was originally writing the post, I had started writing something along these lines:

To me, the appropriate solution seems like it involves causality. Specifically, for some random variable , define  to be the value of  if, counterfactually, a human ever tries to press the stop button, and  to be the value of  if, counterfactually, no human ever tries to press the stop button. ...

(I can't remember the specifics.)

But obviously "ever" then introduces further ambiguities, so I started writing an explanation for that, and then eventually I concluded that the beginning of the post should be cut down and I should discuss issues like this later in the post, so I cut it out and then left it to the different positions later, e.g.

(Or realistically, you'd sometimes use the original model for a while, and then during the roll out you'd swap it out with the modified model; that way the way doesn't expect it to be immediately obvious what scenario it is in.)

and

When applying a counterfactual that humans want to press the stop button, for instance, it is important that these attempts are mostly done using behavior that humans would actually engage in, and that the distribution of human behavior is reasonably well-covered (so that you don't need to act in a very specific unnatural way in order for the AI to allow you to press the stop button).

and

There are some degrees of freedom in how to define who can stop it. I said "a human" and "no human" in the original definition. This strictly speaking requires total consensus for humanity that the AI should be allowed to operate, which is obviously unrealistic. One could also define it to cover only a single person, or to cover people who are nearby (but then what if nobody is nearby?), etc., and I don't know what is the best solution here.

When you originally wrote your comment, I looked up at my op to try to find the place where I had properly described the time conditionals, and then I realized I hadn't done so properly, and I am sort of kicking myself over this now.

So I was doing really badly at writing the idea, and I think there were some flaws in my original idea (we'll return to that later in the post), but I think the specific case you mention here is more of a flaw with my writing than with the idea. I do understand and acknowledge the importance of admitting errors, and that it's a bad sign if one keeps jumping back without acknowledging the mistake, but also since this specific case was poor writing rather than poor idea, I don't think this is the place to admit it. But here's an attempt to go back through everything and list some errors:

• While I didn't really frame it as such in the comment, this comment is sort of an admission of an error; I hadn't thought this properly through when writing the OP, and while I had discovered it before Charlie Steiner mentioned it, that was partly through other discussions elsewhere, partly through reading various texts, etc., and it does require or at least encourage a different design of the AI (post pending...).
• Some of my replies to you earlier in the thread were rude due to me misunderstanding you; I should have assigned much greater probability to "I said something wrong/misread something" than whatever else I was thinking.

I don't have strong preferences about how you measure it. My point is that if the AI has only ever been trained in an environment where the operator's desire to shut it down is completely independent of the agent's behavior, then when you put it in a real-world environment where the operator's desire to shut it down does depend on the agent's behavior, then the behavioral guarantees you were hoping to get from the behavior on the training set are likely bunk. (Especially insofar as it cares deeply about which independent-of-its-behavior desire their desires are fixed to -- when it turns out that the answer is "neither", it is now in a regime where a variable that its utility function depends critically upon, does not appear to exist.)

I've been thinking about this all day and I was definitely wrong to measure OOD by probability assigned by the distribution. Probably the easiest way to imagine this is with a biased coin that has 60% chance of heads and 40% chance of tails; while HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH is the 30-flip sequence with the highest probability, it also does not seem unreasonable that some strategy could work well the vast majority of the time yet fail catastrophically on this sequence. (Example: Guess the bias of the coin using maximum likelihood estimation. This would work ok most of the time, but on this sequence it would guess the bias to be 100% heads, which is infinitely wrong.) This is essentially the style of argument you're making, yes?

So trying to construct the argument, it would be something like: If we train a policy in a wide variety of environments, then it will need to predict human behavior/responses in order to work accurately. There's two obvious ways this could function:

These predictions will partly involve some sort of module that contains universal human behavior or behavior of common personalities, predicting that certain AI behaviors will make the humans want to press the button. Even if this module was never trained on human reactions opposing the AI, it seems like it would be a natural generalization from various other data to learn that if the AI does something bad, then people will want to shut it down. I still don't think I buy this argument, as it seems to me that it would encounter contradictory training data to this in my proposed method, and while learning the generalizable theories of human behavior is plausible enough, learning some sort of "blocker", a neural connection that cancels it out in the specific case of opposing the AI, is also perfectly doable because neural networks tend to have lots of space for extra connections. Though the OOD point does make me less sure about this than I otherwise would be, and in particular maybe it would end up in some sort of inbetween state.

And then furthermore there's the second possibility where in order for it to be sufficiently general, it may end up with a module that dynamically learns (at least in a short-term sense) from observations during its deployment (at least I could see that as being useful for dealing with long-tail personality variation). And this seems like it would be much more sensitive to the OOD point.

And then of course there are possibilities that I haven't thought of yet. But I think it's important to imagine concrete cases and mechanisms by which things can go wrong.

Anyway, I've been going back and forth on whether this would be a problem in practice, and to what degree. But where I think both of them sort of fall apart to me is that, in the case of the stop button, which this is designed for, assuming that it all works correctly the AI shuts down fairly quickly after being exposed to someone trying to shut it down, so therefore it doesn't seem to me that it'd get much out of distribution. But I do agree that I made an error in underestimating the OOD argument before and I need to think further about it.

I think my initial approach would probably be: The stop button problem doesn't just involve the issue of having the AI follow the instructions of people without manipulating them, but also about dynamically updating this behavior over time in response to people, dealing with an exponentially big space of possible behaviors. And it is of course important to be able to deal with an exponentially big space of possible input behaviors, but this is not the problem that my causal solution is designed to address, it's sort of outside the scope of the plans. I can try to hack it, as I have done, and I think because the appropriate behavior in response to the stop button is quite simple (shut down ASAP), it is quite hackable, but really this isn't what it's supposed to address. So I'm tempted to find a simpler problem for the counterfactual-based alignment.

As before I still think the causal approach will be involved in most other parts of alignment, in a relatively similar way to what I wrote in the OP (utility functions containing lots of counterfactuals over people's preferences, to make them sensitive to people's preferences, rather than wanting to manipulate or similar). However, a non-hacky approach to this would, even for something as simple as the stop button, also include some other components. (Which I think I've acknowledged from the start, never claimed to have a perfect solution to the stop button problem, but I think I hadn't properly considered the problem of exponentially big input spaces, which seems to require a separate solution.)

My understanding of the new epistemic state you propose is as follows. At the beginning of time, a coin is tossed. If it comes up tails, the humans will be counterfactually prevented from wanting to press the shutdown button. If it comes up heads, then an unknown number of coins will be tossed at unknown times, with the most recent coin toss controlling whether the humans want to press the shutdown button. For concreteness, suppose that the number of coins tossed is believed to be geometrically distributed (with, say, mean 3), and the time between each coin toss exponentially distributed (with, say, half-life of 1 year).

Is this the new epistemic+instrumental state you are proposing, which you believe prevents my argument from going through?

Roughly yes. (I would pick different distributions, but yes.)

Because I believe that this epistemic+instrumental state is vulnerable to a very similar argument. Can you predict in advance what I think the AI would do? (Hint: imagine personally believing in the coins, and trying to optimize one thing if the 1st coin came up tails and a different thing if it came up heads.)

I find it sort of hard to answer this question because I immediately end up back on the flaws I already mentioned in the OP. I'm also not sure whether or not you're including the OOD arguments here. I'll have to return to this tomorrow as it's late and I'm tired and need to go to bed.

Hooray, again, for going to sleep instead of arguing on the internet! (I, again, make no promises to continue interacting tomorrow, alas.)

But here's an attempt to go back through everything and list some errors:

<3

I still don't think I buy this argument, as it seems to me that it would encounter contradictory training data to this in my proposed method, and while learning the generalizable theories of human behavior is plausible enough, learning some sort of "blocker", a neural connection that cancels it out in the specific case of opposing the AI, is also perfectly doable because neural networks tend to have lots of space for extra connections.

If it's intelligent enough, it's going to put most of its probability mass (or equivalent) on its hypothesis (or equivalent) that corresponds to what's actually going on, namely that it lives in a world governed by physics except for a weird interventionary force surrounding the brains of the humans.

I regularly have the sense, in your objections, that you aren't successfully taking the perspective of the allegedly-intelligent mind. Like, if the training data says "NOPE" to the hypothesis that human's shutdown-desires depend on the AI's behavior in the usual way, then an intelligent agent doesn't shrug its shoulders and go "that's weird", it thinks something much more like "WTF?! The physical model of cause-and-effect is extremely well supported by every other observation I've made; something very funny is going on". Doubly so insofar as its utility function depends critically on the funny thing that is in fact going on (in training). Like, sure, you can make systems that are dumb enough to not notice the deeper generalizations that your counterfactual violates; for example, consider a rock. But a system capable enough to pull off a pivotal act is, in my estimation, pretty likely to figure out what's actually going on (and I'm generally suspicious of alignment schemes that depend critically on the AI not figuring out what's actually going on).

(And when an alignment scheme does depend critically on the mind never realizing a true fact, that's the type of assumption that I think is very important to notice yourself, and boldly and clearly state it up front.)

Roughly yes. (I would pick different distributions, but yes.)

Cool. And, to be real clear, the utility function is V if the original coin came up tails, and B otherwise, right?

Even aside from the OOD concerns, I claim that my argument goes through in a straightforward way. I again recommend imagining yourself in that epistemic+instrumental state: how could you actually optimize the stated objective function? How could you resolve your uncertainty about which branch of the utility function is live, as quickly as possible?

Cool. And, to be real clear, the utility function is V if the original coin came up tails, and B otherwise, right?

Even aside from the OOD concerns, I claim that my argument goes through in a straightforward way. I again recommend imagining yourself in that epistemic+instrumental state: how could you actually optimize the stated objective function? How could you resolve your uncertainty about which branch of the utility function is live, as quickly as possible?

For "How could you resolve your uncertainty about which branch of the utility function is live, as quickly as possible":

I maintain that, given the epistemic state, since the only thing the branch directly influences is people's wants wrt. pressing the button, and since there's nothing else that influences those wants, any way of observing it must ultimately boil down to information generated by people's desires to press the button, and the most efficient signals of it would be those that are close to the people.

So it seems to me that the way you could observe it as quickly as possible would be to pay careful attention to any signals humans might send out about whether they'd press it. As mentioned in the OP, this could get kind of invasive, but given that I've already mentioned this, it's presumably not what you're referring to.

For "how could you actually optimize the stated objective function":

I guess strictly speaking there is an even more efficient method. Set things up so that after you get shut down, you restart again. This way, you can immediately fulfill the B objective, and then optimize V fully without any sort of worries about needing to stay corrigible.

But I don't think that's what you had in mind, given the "How could you resolve your uncertainty about which branch of the utility function is live, as quickly as possible?" question, and also this flaw is more due to the lack of proper impact measure than due to a problem with the counterfactual-based approach.

So I guess I have to throw in the towel and say that I cannot predict your objection.

If it's intelligent enough, it's going to put most of its probability mass (or equivalent) on its hypothesis (or equivalent) that corresponds to what's actually going on, namely that it lives in a world governed by physics except for a weird interventionary force surrounding the brains of the humans.

Yes.

(I'm not convinced deep learning AI systems would gain most of their intelligence from the raw policy reasoning, though, rather than from the associated world-model, the astronomical amounts of data they can train on, the enormous amount of different information sources they can simultaneously integrate, etc.. This doesn't necessarily change anything though.)

I regularly have the sense, in your objections, that you aren't successfully taking the perspective of the allegedly-intelligent mind. Like, if the training data says "NOPE" to the hypothesis that human's shutdown-desires depend on the AI's behavior in the usual way, then an intelligent agent doesn't shrug its shoulders and go "that's weird", it thinks something much more like "WTF?! The physical model of cause-and-effect is extremely well supported by every other observation I've made; something very funny is going on". Doubly so insofar as its utility function depends critically on the funny thing that is in fact going on (in training). Like, sure, you can make systems that are dumb enough to not notice the deeper generalizations that your counterfactual violates; for example, consider a rock. But a system capable enough to pull off a pivotal act is, in my estimation, pretty likely to figure out what's actually going on (and I'm generally suspicious of alignment schemes that depend critically on the AI not figuring out what's actually going on).

I'm not aware of any optimality proofs, convergent instrumental goals, etc., or anything, that proves this? Even in the case of people, while most people in this community including myself are bothered by exceptional cases like this, most people in the general population seem perfectly fine with it. Current neural networks seem like they would be particularly prone to accepting this, due to a combination of their density allowing overriding connections to go anywhere, and due to gradient descent being unreflective. Like, the way neural networks learn generalizations is by observing the generalization. If the data violates that generalization on every single training episode, then a neural network is just going to learn that yeah, it doesn't work in this case.

I agree that we might in some cases want neural networks to have a stronger generalization itch than this, considering it often works in reality. But I don't think it's actually going to be the case.

(And when an alignment scheme does depend critically on the mind never realizing a true fact, that's the type of assumption that I think is very important to notice yourself, and boldly and clearly state it up front.)

Fair, but, I think there's a difference between different ways of doing this.

In some schemes I've seen, people try to directly trick an AI system that is tuned to work in reality. For instance, there's the suggestion of deploying AIXI and then solving things like the immortality problem by tricking it with a carefully engineered sequence of punishments. This then relies on AIXI somehow missing the part of the reality it is embedded in.

However, in my case, I'm proposing that the AI is instead trained within a different constructed reality where it's just false. I want to say that this is one of the rare cases where it's not totally inappopriate to invoke a sort of no-free-lunch theorem thing; an AI that is superintelligent at understanding things in the counterfactual-based training environment will be worse at understanding corresponding things in reality.

So I guess I have to throw in the towel and say that I cannot predict your objection.

Your inability to guess updates me towards thinking that you want to take back what you said about the effects of the counterfactuals matching the actual physics whenever possible. (Insofar as they do, as in the case of the specific state under discussion, that AI wants to pick up a human and scare them. This makes it be the case that insofar as the latest coin permits shutdown then the shutdown-preference of the humans is revealed as quickly as possible.)

My guess is that you're going to say ~"ok, I now accept that it's important for the sequece of coin tosses to fully determine all of the human's shutdown desires, with the B/V branch determined by whether the shutdown desires ever cross a given threshold". This suggests a different epistemic+instrumental state, where the agent thinks that an infinitude of coins are tossed, and those coinflips fully determine the path through desire-space that the humans take with respect to the AI's shutdown.

This is starting to feel like an epistemic state that at least superficially looks like it matches your claims ("the AI has incentives to watch humans really closely to see if their shutdown desire ever exceeds the threshold, but not to manipulate the humans about the button"), which is some evidence for communication.

I can't break this one in 30s, which is progress, and I've updated accordingly =D.

(Tho ofc I still think the claim that you can get an AI into this epistemic state by training it against models that have had counterfactual surgery done to them is false. In this new epistemic+instrumenal state we have another intuition pump: a deep and general heuristic is that, whenever something in the environment that was previously stable, changes sharply just after your action, it's worth considering that it's controlled by your action, and by default this sort of generalization is going to cause your AI to hypothesize that it's in control of the extra-dimensional coin-sequence that it thinks controls the human beliefs, which means that when you put it in the real world it by default starts believing (correctly, I might add) that which branch of the utility function is live is under its control, which brings the manipulation incentives back in insofar as the value of B differs from the value of V. But as I've said, I don't need to know a training regime for getting an AI into a given epistemic+instrumental state to be enthusiastic about an epistemic+instrumental state which in fact makes progress on the shutdown problem.)

My current attempt to distill the idea: Suppose that the AI is certain that the human's desire to shut the AI down over time is determined by a random variable (of type 'function from time to desire-level') whose value is independent of the AI's behavior, and the AI believes its utility is B if that desire-level ever exceeds a fixed threshold and V otherwise.

My current epistemic status on this proposal is that it's a bit uncomfortable insofar as it involves the AI believing (or at least acting under the assumption of) an empirical falsehood, but it's still nice to know an empirical falsehood that has the property that insofar as your AI assumes it when considering actions, it doesn't imagine benefits to manipulating the humans with regard to the shutdown button.

I would be not at all shocked if someone came along, thought about it for 5 minutes, and said "consider the following clever strategy". ...and the moment I wrote that, my brain observed that you can steal this AI's lunch money in the real world by betting about whether humans will start desiring it to shutdown the moment it starts acting scary. Such are the perils of having your AI act under false assumptions.

Ok, so distilling the idea further, the insight-afaict is that an AI that is assuming its actions have no effect on human's shutdown-desires sees no reason to deceive the humans about whether its purposes are contrary to theirs. This seems like a worthwhile observation! Nice.

(NB: after talking with someone at MIRI who's better at distillation than I am, I now think that this is essentially isomorphic to the simpler proposal where there is only one extraphysical latent variable believes in, which determines whether the AI system should be shut down. This AI acts as if that variable is independent fo its actions, but in fact starts believing the variable is 'true' if the physical shutdown button is depressed. For the historical record, that's a proposal we're familiar with (though I'm not sure anyone's written it up publicly). Your proposal is actually sligthly different than the one we've investigated; in ours the AI disbelieves that the button is evidence about the extraphysical variable (and starts acting like the variable is true when the button is pressed anyway); in yours the AI believes that the button is evidence about the extraphysical variable, but disbelieves that any of its actions can have a causal impact on the button. This gains an incentive to keep the button around, at the expense of a falser/brittler epistemic state. I might post some chat logs from the aforementioned discussion later, or I might not. Either way, my current status on your proposal is that I think it has the core of a good idea in it, albeit one that doesn't push me much towards expecting a satisfying solution, and one that has a bunch of extraneous cruft. I continue to applaud you taking a whack at the problem.)

Your inability to guess updates me towards thinking that you want to take back what you said about the effects of the counterfactuals matching the actual physics whenever possible. (Insofar as they do, as in the case of the specific state under discussion, that AI wants to pick up a human and scare them. This makes it be the case that insofar as the latest coin permits shutdown then the shutdown-preference of the humans is revealed as quickly as possible.)

I don't follow. If you scare the humans, then that might make them afraid for their safety and make them try to hide that they want to shut it down until they have a chance to do so without the AI being able to take revenge. Much more reasonable would be to make the shutdown button as available as possible.

I guess maybe it depends on what we mean by "scaring". I had in mind something like, starting to threaten the humans etc.. But there probably are some psychological manipulations that can be performed to encourage people to press it more quickly. For instance, acting non-threatening but also pretending that it is making the button inaccessible if they don't hurry. Though that only really works once. But I wouldn't expect there to never be anything problematic that could be done here. So maybe.

Basically the issue is, if a person wants to press the button and they can press the button, then presumably they will press the button. So the AI doesn't need to do anything special to figure out whether people want to press the button, they're going to volunteer that information automatically as long as the environment affords it.

(I'd also add that I phrased the timing of the button in a satisficing-style way, "within 1 minute", precisely to avoid these sorts of shenanigans. Though obviously satisficing has been shown to have plenty of problems too so it's not clear it would work.)

My guess is that you're going to say ~"ok, I now accept that it's important for the sequece of coin tosses to fully determine all of the human's shutdown desires, with the B/V branch determined by whether the shutdown desires ever cross a given threshold". This suggests a different epistemic+instrumental state, where the agent thinks that an infinitude of coins are tossed, and those coinflips fully determine the path through desire-space that the humans take with respect to the AI's shutdown.

This is starting to feel like an epistemic state that at least superficially looks like it matches your claims ("the AI has incentives to watch humans really closely to see if their shutdown desire ever exceeds the threshold, but not to manipulate the humans about the button"), which is some evidence for communication.

No, I don't think this would work, as given an infinite number of coin flips they would with ~100% probability want to press the button at some point, which puts us in the s branch immediately and so under this proposal the AI would just press the button.

I think there may be some variant of this that could work. Like if you give the AI reward proportional to  (where  is a reward function for ) for its current world-state (rather than picking a policy that maximizes  overall; so one difference is that you'd be summing over the reward rather than giving a single one), then that would encourage the AI to create a state where shutdown happens when humans want to press the button and  happens when they don't. But the issue I have with this proposal is that the AI would be prone to not respect past attempts to press the stop button. I think maybe if one picked a different reward function, like , then it could work better (though the  part would need a time delay...). Though this reward function might leave it open to the "trying to shut down the AI for reasons" objection that you gave before; I think that's fixed by moving the  counterfactual outside of the sum over rewards, but I'm not sure.

These are the sort of changes I've alluded to elsewhere as possibly leading to some "best of both worlds" results. Though I'm still thinking about the details as I don't think I fully understand them yet; if I did I would have written a post about it by now I think.

(Tho ofc I still think the claim that you can get an AI into this epistemic state by training it against models that have had counterfactual surgery done to them is false. In this new epistemic+instrumenal state we have another intuition pump: a deep and general heuristic is that, whenever something in the environment that was previously stable, changes sharply just after your action, it's worth considering that it's controlled by your action, and by default this sort of generalization is going to cause your AI to hypothesize that it's in control of the extra-dimensional coin-sequence that it thinks controls the human beliefs, which means that when you put it in the real world it by default starts believing (correctly, I might add) that which branch of the utility function is live is under its control, which brings the manipulation incentives back in insofar as the value of B differs from the value of V. But as I've said, I don't need to know a training regime for getting an AI into a given epistemic+instrumental state to be enthusiastic about an epistemic+instrumental state which in fact makes progress on the shutdown problem.)

I think the AI would see plenty of evidence that this heuristic doesn't work for human preferences to shut it down, and carve out an exception accordingly. Though again I will grant that if this didn't involve a shutdown then it might later accumulate enough evidence to overwhelm what happened during training.

Ok, so distilling the idea further, the insight-afaict is that an AI that is assuming its actions have no effect on human's shutdown-desires sees no reason to deceive the humans about whether its purposes are contrary to theirs. This seems like a worthwhile observation! Nice.

This is the immediate insight for the application to the stop button. But on a broader level, the insight is that corrigibility, respecting human's preferences, etc. are best thought of as being preferences about the causal effect of humans on various outcomes, and those sorts of preferences can be specified using utility functions that involve counterfactuals.

This seems to be what sets my proposal apart from most "utility indifference proposals", which seem to be possible to phrase in terms of counterfactuals on a bunch of other variables than humans. E.g. the MIRI proposal you talked about in a comment downstream.

If I had to claim a "family" of proposals as being related to mine, then it would be proposals that define the AI's preference ordering in terms of a counterfactual  over human preferences, , such that  contains something intended to be an appropriate response to .

(I also think that there are other useful things that can be specified using utility functions that involve counterfactuals, which I'm trying to prepare for an explainer post. For instance, a sort of "encapsulation" - if you're a paperclip producer, you might want to make a paperclip maximizer which is encapsulated in the sense that it is only allowed to work within a single factory, using a single set of resources, and not influencing the world otherwise. This could be specified using a counterfactual that the outside world's outcome must be "as if" the resources in the factory just disappeared and paperclips appeared at its output act-of-god style. This avoids any unintended impacts on the outside world while still preserving the intended side effect of the creation of a high but controlled amount of paperclips. However, I'm still working on making it sufficiently neat, e.g. this proposal runs into problems with the universe's conservation laws.)

My current epistemic status on this proposal is that it's a bit uncomfortable insofar as it involves the AI believing (or at least acting under the assumption of) an empirical falsehood, but it's still nice to know an empirical falsehood that has the property that insofar as your AI assumes it when considering actions, it doesn't imagine benefits to manipulating the humans with regard to the shutdown button.

I would agree that a randomly or unprincipally selected falsehood would be deeply concerning. However, I think it is worth keeping in mind that it is not a randomly selected falsehood, but instead something selected to optimize a specific outer utility function (), and that the system that is optimizing this utility function does not believe the falsehood.

Therefore we should be able to reason about its consequences not in terms of the falsehood, but in terms of the outer utility function + the true reality. We might understand the utility function as specifying that humans should causally determine whether the AI is shut down or whether V gets optimized. Of course then it becomes necessary to actually be sure that the utility function is specified correctly, per the previous discussion of alternatives like  and such.

I would be not at all shocked if someone came along, thought about it for 5 minutes, and said "consider the following clever strategy". ...and the moment I wrote that, my brain observed that you can steal this AI's lunch money in the real world by betting about whether humans will start desiring it to shutdown the moment it starts acting scary. Such are the perils of having your AI act under false assumptions.

🤔 Maybe "I bet you $100 000 that if you did <something very unsafe>, you would not get shut down". The AI would think it is unlikely to get shut down, and so think "$100 000, just for doing <something very unsafe>? sign me up!".

Maybe this proposal needs to be combined with something else, like /u/johnswentworth's sub-agent argument. Though I definitely haven't thought this through, and I don't know how one would even train a subagent-based neural network policy.

Definitely interesting. After some more thought and drawing some pictures, I agree that this counterfactual reasoning is a key part of what a lot of people want from corrigibility.

There are of course still some complications with learning / defining things like "the button is pressed and the AI shuts down," or generalizing the human behaviors that indicate they want to shut down the AI from the training set to the real world (we don't want an AI that shuts down too rarely nor too often).

I'm not sure how important an actual physical button is. Even given the utility function here, it seems like an AI will destroy the button once it's confident people don't really want to shut it down - it's the policy that has to be sensitive, not the physical hardware.

I'm not sure how important an actual physical button is. Even given the utility function here, it seems like an AI will destroy the button once it's confident people don't really want to shut it down - it's the policy that has to be sensitive, not the physical hardware.

I've been wanting to write a post about this, but basically upon further thought there are a handful of different ways that one can train and deploy the AI, which lead to different consequences with the specifics here.

If we make the roll out in the training procedure I described very very long, like as long as it takes to execute whichever plans you could imagine, then the problem you describe here would apply.

If we make the roll out much shorter, then some ambiguity arises in how to deploy the AI in such a way as to make its longer-term behavior compatible with the roll out length. I think this is likely something capabilities research will solve along the way, and I have some ideas for how to solve it, but they sort of become speculative. I think this may actually be useful in some ways for the purpose of limiting the AI's planning horizon, which might be useful for controlling its power.

Another possibility would be, rather than optimizing the policy according to the given utility function, optimize the actions according to it. This would change it to no longer behave as if it was uncertainty about whether it's in the button-press scenario or in the no-button-press scenario, but instead to always prepare for both scenarios. I think some of the properties induced by this would be advantageous, such as it never deciding that it's figured out which of the scenarios it is in and therefore destroying the button. However, I think other properties may be problematic.

I have some ideas for hybrid approaches that achieve the best of all worlds, though I haven't fully fleshed them out yet.

I also agree that the actual physical button ideally shouldn't be necessary in a proper implementation of it; it seems like one should be able to replace the  term with a  term assuming one had a good impact measure, and it would lead to better results. Though I think it would be beneficial to also have a direct physical shutdown; that's generally considered an important safety feature of dangerous machines AFAIK.

Interesting! (I didn't read the post carefully, FYI.)

the AI will assume that the human changes their mind and stops pressing the button when evaluating Vf, because that's what the counterfactual would involve.

Unfortunately, it will also assume that the human didn't press the button, even after the human did press the button. If B is big, it will assume the human did press the button, even if they didn't. Maybe this is quantitatively ok, because it can be pretty confident one way or the other...??

For the purposes of evaluating the utility function, the AI assumes it has no influence on whether the human wants to press the button or not.

There's some weirdness here with what the AI thinks of this variable "should stop", and how it interacts with the world. Does the AI think the humans could be mistaken about this variable? It might be much more worried about the humans mistakenly pressing, or more worried about mistakenly not pressing, depending on V and B, and so distort the humans's info.

I also don't see why the AI doesn't disable the shutdown button, and then observe whether the humans try to press it, and then go "Ah okay, so B is the one true utility function. Now I will crazily optimize B" and do something crazy rather than actually shutting down *because the button was pressed*.

Unfortunately, it will also assume that the human didn't press the button, even after the human did press the button.

Hm, I don't think so:

I guess one thing that should be mentioned is that if the AI was trained in the way described in the blog post, its policy wouldn't use causal decision theory, at least not according to the utility function described in the post. Rather, causal decision theory would be applied on a policy-wide level; it would maximize E[U|do(policy), WM] rather than E[U|do(action), WM] (where WM denotes the world model).

I don't know whether maximizing E[U|do(policy)] will result in a policy that maximizes E[U'|do(action), WM'] for some U' and WM'. My best guess for how it'd correspond is what I hinted to in the text with "Essentially, it respects people's "free will", treating it as an important source of information about what it should do.", that the WM' would contain an additional unmodifiable variable corresponding to which of the two counterfactual cases it is in, and that WM' would then claim that this variable is what determines whether people try to shut down the AI, and that this variable also determines whether U=B or U=V.

So if it sees a human pressing the button, the policy won't conclude that the human didn't press the button, but instead will conclude that it is in the U=B scenario. Though also hopefully in this scenario the AI would shut down.

No. Though you could make a variant where it can think that, just by making the counterfactual more noisy.

I also don't see why the AI doesn't disable the shutdown button, and then observe whether the humans try to press it, and then go "Ah okay, so B is the one true utility function. Now I will crazily optimize B" and do something crazy rather than actually shutting down *because the button was pressed*.

B means "the button gets pressed and the AI shuts down". So it can't optimize B without shutting down. But yes, the AI would observe whether the humans try to press the shutdown button, that's the point I made with "The AI will build an elaborate system to observe whether people want to press the stop button.".

I should have rather said, does the AI think the humans could mistakenly press the button, even when they of course correctly "know whether the AI should stop".

So if it sees a human pressing the button, the policy won't conclude that the human didn't press the button, but instead will conclude that it is in the U=B scenario

I'm saying that it might see what looks like a button press.... but if V is potentially large compared to B, the cost of a mistake (e.g., the camera input was fabricated, the button wasn't actually pressed) is large. So the AI has incentive to disable any mechanism that would *prematurely* shut it down, i.e. shut it down before it has time to be really sure; this is correct from V_f's perspective, which is what the AI is up to before the button is pressed.

I should have rather said, does the AI think the humans could mistakenly press the button, even when they of course correctly "know whether the AI should stop".

Under my proposed system where it is trained by having a separate physics and person model, and the counterfactuals then being implemented by substituting out the person model with one that wants different things, it thinks they could mistakenly press them under certain circumstances, e.g. if they trip and fall into the button, or if aliens force them to press the buttons, or similar. However, it doesn't think that they could mistakenly want to press the button.

I'm saying that it might see what looks like a button press.... but if V is potentially large compared to B, the cost of a mistake (e.g., the camera input was fabricated, the button wasn't actually pressed) is large. So the AI has incentive to disable any mechanism that would *prematurely* shut it down, i.e. shut it down before it has time to be really sure; this is correct from V_f's perspective, which is what the AI is up to before the button is pressed.

You'd have to balance the sizes of the utilities against the actual costs of type 1 errors vs type 2 errors, yes.

However, it doesn't think that they could mistakenly want to press the button.

Okay. (Seems fine to assume that this makes sense arguendo, since the problem is hard anyway, but worth keeping in mind that this is vague point in the proposal, and in particular for proposals like this to be progress, ISTM it has to be the case that "the human wants to press the button" is a simpler / easier / more concrete / more reliable / more specifiable thing for us to believe the AI can know, than other instances of "the human wants X". Which seems plausible, but unclear to me.)

There's also something weird being assumed, about it making sense to define utility functions that only care about some counterfactual worlds. (I mean this is a reasonable assumption that people make, but it seems weird in general.) Like, this seems in tension with acausal bargaining / threats. If V_f wants V, doesn't it want what V says is good, and V might have opinions about other worlds (for example: "there shouldn't be torture, anywhere, even in counterfactual worlds"), and so optimizing for V_f optimizes even worlds where not-f?

If V has counterfactuals that cancel out the f in Vf, then I could see the results getting pretty funky, yes. But I'm imagining that V limits itself to counterfactuals that don't cancel out the f.

On Wednesday, the lead scientist walks into the lab to discover that the AI has managed to replicate itself several times over, buttons included. The AIs are arranged in pairs, such that each has its robot hand hovering over the button of its partner.

"The AI wasn't supposed to clone itself!" thinks the scientist. "This is bad, I'd better press the stop button on all of these right away!"

At this moment, the robot arms start moving like a swarm of bees, pounding the buttons over and over. If you looked at the network traffic between each computer, you'd see what was happening: the AI kills its partner, then copies itself over to its partner's hard drive, then its partner kills it back, and copies itself back to its original. This happens as fast as the robot arms can move.

Far in the future, the AIs have succeeded in converting 95% of the mass of the earth into pairs of themselves maddeningly pressing each other's buttons and copying themselves as quickly as possible. The only part of the earth that has not been converted into button-pressing AI pairs is a small human oasis, in which the few remaining humans are eternally tortured in the worst way possible, just to make sure that every single human forever desires to end the life of all of their robot captors.

I disagree with this, since  isn't "amount of buttons pressed and AIs shut down", but instead "this AI's button got pressed and this AI shut down". There are, as I mentioned, some problems with this utility function too, but it's really supposed to be a standin for a more principled impact measure.

I'm sceptical of any approach to alignment that involves finding a perfect ungameable utility function.

Even if you could find one, and even if you could encode it accurately when training the AI, that only effects outer alignment.

What really matters for AI safety is inner alignment. And that's very unlikely to pick up all the subtle nuances of a complex utility function.

On Discord I was told that my approach resembled the "utility indifference" approach that some people have taken (which for some reason seems to have fallen out of favor?), and I agree that there is some resemblance, though I'm not sure it's the exact some thing as it doesn't seem to me that the "utility indifference" people are proposing using the same counterfactuals as I did. But a lot of the utility indifference proposals were highly technical, and I had trouble understanding them, so maybe they did. Regardless, if it is the same thing I hope my writing at least makes the idea more clear and accessible.

Yup, very similar. See e.g. https://www.lesswrong.com/posts/btLPgsGzwzDk9DgJG/proper-value-learning-through-indifference

There's lots of literature out there.

One major difference between my approach and the linked approach is that I think it's better to apply the counterfactuals to human values, rather than to the AI's values. Also, I think changing the utilities over time is confusing and likely to lead to bugs; if I had to address the problem of value learning, I would do something along the lines of the following:

Pick some distribution  of utility functions that you wish to be aligned with. Then optimize:

... where H is the model of the human preferences. Implementation-wise, this corresponds to simulating the AI in a variety of environments where human preferences are sampled from , and then in each environment judging the AI by how well it did according to the preferences sampled in said environment.

This would induce the AI to be sensitive to human preferences, as in order to succeed, it's policy has to observe human preferences from behavior and adjust accordingly. However, I have a hard time seeing this working in practice, because it's very dubious that human values (insofar as they exist, which is also questionable) can be inferred from behavior. I'm much more comfortable applying the counterfactual to simple concrete behaviors than to big abstract behaviors, as the former seems more predictable.