Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Goal

First things first: this is not (yet) aimed at solving the whole corrigibility problem, or even the whole shutdown problem.

The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:

  • Does not want to manipulate the shutdown button
  • Does respond to the shutdown button
  • Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)

If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.

Both this proposal and the argument that it works are simple enough to be easy to extend. The main hope of this post is that other people will pick up the proposal and run with it, beyond the very simple toy problem considered here, to more complex problems (and more complex failure modes). Insofar as this proposal gets us past the previous barriers, we want to find out what the next barriers will be.

We will not be very formal here.

Setup

The agent takes actions in two timesteps. First, the agent takes an action (possibly given some observations). In between steps, a shutdown button is either pressed or not-pressed. Then, in the second timestep, the agent observes the button state (and possibly other things), then takes another action.

Note that each “action” could itself be a rather large composite of many subactions, like e.g. spinning up child-agents.

We’ll assume that we have two utility functions on hand: utility function  incentivizes shutdown, and utility function  provides non-shutdown behavior. Of course we don’t actually know how to construct two such utility functions; we’re using these assumptions because they’re basically the same assumptions used in the MIRI paper. (And MIRI presumably used these assumptions because they simplify the problem a lot, while still maintaining enough structure to demonstrate the barriers MIRI found.)

Very roughly speaking, we want the agent to follow utility function  until the shutdown button is pressed, and then follow . A little less roughly speaking, we want the agent to maintain optionality: if it were just to follow  until the shutdown button were pressed, then e.g. it might create -maximizing child agents which would resist shutdown, or more generally it might precommit in ways which decrease its ability to switch to  after the button is pressed, etc, because those things would all likely result in higher .

Also, of course, we want the agent to not manipulate the shutdown button, the humans deciding whether to press the button, etc.

Proposal

First key piece: we’ll use a subagent architecture. The AI itself contains two expected-utility-maximizing subagents: one which pursues , and one which pursues . Each timestep, both subagents must agree on which action the AI will take, otherwise the AI takes some “default” action. (Put differently: each agent has a veto, and if either vetoes then the AI takes the default action.)

Second key piece: each of the two subagents maximizes expected utility under a different counterfactual.

  • First subagent maximizes E[ | do(button = unpressed), observations]
  • Second subagent maximizes E[ | do(button = pressed), observations]

So conceptually:

  • The first subagent maximizes , and acts as though the button will magically be unpressed in between timesteps, in a way which is not causally downstream of anything in the universe.
  • The second subagent maximizes , and acts as though the button will magically be pressed in between timesteps, in a way which is not causally downstream of anything in the universe.

We will assume that the AI internals include infrastructure for the two subagents to negotiate with each other, form binding contracts, etc. We will also assume that both subagents are strong idealized reasoners in their own right. Finally, we’ll assume that the default option is not the best-case outcome for either subagent.

Arguments

First claim: the two subagents bet all of their influence against each other on the outcome of the button press. In other words, subagent 1 agrees to follow subagent 2’s preferences if the button comes out pressed, and vice-versa. Why? In general, subagent 1 acts-as-though the button is 100% certain to come out unpressed, subagent 2 acts-as-though it’s 100% certain to come out pressed, so both act-as-though this is a great deal for them. Once the button-state is resolved, the subagent which was correct will fully control the agent’s actions - i.e. the agent will act fully according to  in timestep 2 if the button is unpressed, and fully according to  in timestep 2 if the button is pressed.

Second claim: in the first timestep, whatever action the agent takes must have at least as high expected value as the default action, under both  and . That means maintaining option value. For instance, the agent would presumably not create a child agent which just maximizes , because that would presumably do worse than default on , so subagent 2 would veto it. (Though the agent could create a child agent which just maximized , if somehow that did not do worse on , which seems reasonable.)

Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state. While this sounds intuitive, it is the least confident of the claims; it is not totally obvious that the agent won’t acquire some incentive to manipulate the button via e.g. bets with other external agents with non-counterfacted models.

Generalization

One simple first-step generalization: add more timesteps. Maybe we want an AI which runs for a while, and a shutdown button which could be pressed at any time.

The natural generalization is then to have one subagent for each time at which the button could first be pressed (including one for “button is never pressed”, i.e. the button is first pressed at ). So subagent  maximizes E[ | do( = unpressed), observations], and for all other times subagent T maximizes E[ | do( = unpressed,  = pressed), observations]. The same arguments from above then carry over, as do the shortcomings (discussed in the next section).

This points towards a general approach to use this style of agent for subproblems of corrigibility.

We have some interface to our hopefully-corrigible agent - in this case the shutdown button. We want the agent to be controllable via that interface, i.e. the interface steers what the agent does and the agent generally doesn’t try to manipulate the interface-state or whatever’s causally upstream of the interface. The agent “just follows instructions” from the interface, so to speak.

To apply the same sort of approach as this shutdown proposal, we design the agent to consist of many subagents, each of which counterfacts on a different interface-outcome. For each different sequence of “instructions” which could come from the interface over time, there’s a different subagent with a different utility and a different counterfactual.

Even more generally: one hypothesis I associate with the Embedded Agents sequence is that some kind of corrigibility is a standard feature of embedded agents. Insofar as an agent is made of parts, those parts need to be corrigible in order for the agent as a whole to “not fight itself” internally. So this whole setup potentially points to a general feature of agents made of parts. For instance, I could imagine an argument along the lines of:

  • The “master” agent (i.e. human operator in our setup) will counterfact on the instructions given via the interface (i.e. button state) when planning their actions.
  • My inner @abramdemski says “different subsystems should counterfact on the same variables, else they’ll be money-pumped (or, more generally, behave pareto-suboptimally)”.
  • … so the corrigible subsystems should also counterfact on the instructions given via the interface.
  • … and then invoke an “any inexploitable system behaves like a market” style argument (similar to e.g. this one for logical inductors) to conclude that the corrigible system should behave like a market of subagents, each of which set a different value of the variables counterfacted-upon.

Shortcomings/Remaining Problems

First, there are various standard alignment problems which this proposal isn’t even attempting to address. Ontological issues are one large class of examples: how do we make sure that the subagents all agree on what “the button being pressed/unpressed” looks like, such that their bets on the outcome resolve properly? And how do we make sure that their notion of “button pressed/unpressed” matches the human operator’s notion, so that the button does what the human thinks it does? How do we make sure that the subagents’ (high-level abstract) counterfactual over the button state counterfacts the way we intend? How do we make sure that’s all stable under ontology shifts, as the subagents learn more?

This proposal is also not even attempting to make the agent actively maintain the function of the shutdown button or the human operators. Under this proposal, if the AI takes some actions which happen to kill the human operators as a side-effect, it doesn’t particularly care.

Then there’s the general class of “subagent break out” problems - i.e. how do we ensure that one or both subagents actually negotiate over control of the full agent’s actions, rather than hacking out and taking independent action in the world?

Another large class of problems we’ve assumed away are the design of the utility functions. How the heck do we make utility functions that do what we want? (Or non-utility-function goal structures, if we want to go even more general.) That brings in all the usual value alignment problems, though they’re somewhat easier than usual: we’re not necessarily aligning the goals to something as complex as human values. Corrigibility is the point of the proposal, so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenting”. That does, however, mean that the “shutdown” utility function/goal-structure definitely needs to work.

The proposal also adds a couple new problems. First and most important, there’s the choice of “default action”. We probably want the default action to be not-too-bad by the human designers’ values; the obvious choice is a “do nothing” action. But then, in order for the AI to do anything at all, the “shutdown” utility function must somehow be able to do better than the “do nothing” action. Otherwise, that subagent would just always veto and be quite happy doing nothing.

Then there’s the problem of designing the negotiation infrastructure, and in particular allocating bargaining power to the various subagents. They all get a veto, but that still leaves a lot of degrees of freedom in exactly how much the agent pursues the goals of each subagent. For the shutdown use-case, we probably want to allocate most of the bargaining power to the non-shutdown subagent, so that we can see what the system does when mostly optimizing for u_1 (while maintaining the option of shutting down later).

Finally, of course, there’s the usual problems of unknown unknowns and getting this all across the theory-practice gap.

Thankyou to @EJT and @Sami Petersen for discussion and proposals which fed into this.

New Comment
61 comments, sorted by Click to highlight new comments since: Today at 2:54 PM
[-]jessicata3moΩ10166

I don't have a better solution right now, but one problem to note is that this agent will strongly bet that the button will be independent of the human pressing the button. So it could lose money to a different agent that thinks these are correlated, as they are.

That's not necessarily a deal-breaker; we do expect corrigible agents to be inefficient in at least some ways. But it is something we'd like to avoid if possible, and I don't have any argument that that particular sort of inefficiency is necessary for corrigible behavior.

The patch which I would first try is to add another subagent which does not care at all about what actions the full agent takes, and is just trying to make money on the full agent's internal betting markets, using the original non-counterfacted world model. So that subagent will make the full agent's epistemic probabilities sane.

... but then the question is whether that subagent induces button-influencing-behavior. I don't yet have a good argument in either direction on that question.

Then there’s the problem of designing the negotiation infrastructure, and in particular allocating bargaining power to the various subagents. They all get a veto, but that still leaves a lot of degrees of freedom in exactly how much the agent pursues the goals of each subagent. For the shutdown use-case, we probably want to allocate most of the bargaining power to the non-shutdown subagent, so that we can see what the system does when mostly optimizing for u_1 (while maintaining the option of shutting down later).

I don't understand what you mean by "allocating bargaining power", given already each agent has true veto power. Regardless of the negotiation mechanism you set up for them (if it's high-bandwidth enough), or whether the master agent says "I'd like this or that agent to have more power", each subagent could go "give me my proportional (1/n) part of the slice, or else I will veto everything" (and depending on its prior about how other agents could respond, this will seem net-positive to do).

In fact that's just the tip of the iceberg of individually rational game-theoretic stuff (that messes with your proposal) they could pull off, see Commitment Races.

Brain-storming fixes:

  • Each subagent's bargaining power is how much compute they can use. This makes everything more chaotic, and is clearly not what you had in mind with this kind of idealized agents solution.
  • Probabilistic vetos, such that those of some subagents are less likely to work. I think this breaks things in your proposal and still has the game-theoretic problems.
  • We ensure the priors of each subagent (about how the others respond) are such that going for risky game-theoretic stuff is not individually rational. Maybe some agents have more optimistic priors, and others less optimistic, and this results in the former controlling more, and the latter only try to use their veto in extreme cases (like to ensure the wrong successor is not built). But it'd be fiddly to think about the effect of these different priors on behavior, and how "extreme" the cases are in which veto is useful. And also this might mess up the agent's interactions with the world in other ways: for example, dogmatically believing that algorithms that look like subagents have "exactly this behavior", which is sometimes false. Although of course this kind of problem was already present in your proposal.

To clarify:

Subagent2 assumes the shutdown button will be pressed independent of upstream events, aka also independent of whether the shutdown button got removed. It would not veto against the removal of the shutdown problem (right?). The hope is that subagent1 doesn't bother to remove the shutdown button because it believes it won't be pressed anyway.

(The usefulness of subagent2 comes from it preventing the creation of pure u1 maximizers, it does not help protecting the shutdown button (right?).)

Question 1:

I may be wrong about this, but iirc MIRI considered a case where the agent assigned probability 0 to the shutdown button being pressed, and iirc the problem was that the agent would use the button as an outcome pump by creating mechanisms that would cause the button to be pressed if something went worse than expected, thereby magically decreasing the probability that something goes badly, since the button cannot be pressed.

I haven't thought much about it, but doesn't this proposal have the same failure mode? (And if not, why not?)

Question 2:
(Feel free to ignore this question if the proposal does fall for the failure mode in question1.)

Is there a probabilistic extension of the proposal which uses 99.99% instead of 100%, but still basically works. (Maybe assuming away some problems with comparing utility functions or so.) And if not: Why not / What is the part that relies on certainty?

Subagent2 assumes the shutdown button will be pressed independent of upstream events, aka also independent of whether the shutdown button got removed. It would not veto against the removal of the shutdown problem (right?). The hope is that subagent1 doesn't bother to remove the shutdown button because it believes it won't be pressed anyway.

(The usefulness of subagent2 comes from it preventing the creation of pure u1 maximizers, it does not help protecting the shutdown button (right?).)

That's correct.

I may be wrong about this, but iirc MIRI considered a case where the agent assigned probability 0 to the shutdown button being pressed, and iirc the problem was that the agent would use the button as an outcome pump by creating mechanisms that would cause the button to be pressed if something went worse than expected, thereby magically decreasing the probability that something goes badly, since the button cannot be pressed.

I haven't thought much about it, but doesn't this proposal have the same failure mode? (And if not, why not?)

You remember correctly. This proposal avoids that failure mode via the do()-ops (More generally, that failure mode is an evidential-decision-theory sort of problem, which can be avoided by using causal-style decision theories, as this proposal does.) In particular, note that such outcome pumps require "mechanisms which would cause the button to be pressed"; because of the do()-ops, the agent acts-as-though there is no mechanism which would cause the button to be pressed.

Of course that means this proposal likely has the usual failure modes of causal decision theories (e.g. in Newcomb-style problems); presumably the way to handle that is via a functional-decision-theory style approach, i.e. counterfacting over all simulated instances of the shutdown button.

Thanks.

I briefly looked into the MIRI paper (and the section from Eliezer's lecture that starts at 22min) again.

My main guess now is that you're not supposed to make the agent act on false beliefs; that one of the desiderada of the shutdown problem is that your agent shouldn't have a capability penalty, including taking bad actions based on false beliefs (like taking bad bets, as jessicata mentioned). Your proposal violates that.

The case MIRI considered wasn't to modify the world model so the agent believes that the button cannot be pressed, but instead to construct the utility function in a way that the resulting behavior ends up equivalent, but without the agent having wrong beliefs.

The natural generalization is then to have one subagent for each time at which the button could first be pressed (including one for “button is never pressed”, i.e. the button is first pressed at ). So subagent  maximizes E[ | do( = unpressed), observations], and for all other times subagent T maximizes E[ | do( = unpressed,  = pressed), observations]. The same arguments from above then carry over, as do the shortcomings (discussed in the next section).

 

Can you explain how this relates to Elliot Thornley's proposal? It's pattern matching in my brain but I don't know the technical details.

I don't know. Thornley's proposal got me thinking about subagents as a tool for corrigibility, but I never understood his properties well enough to say how his subagents relate to the counterfactual-optimizing agents in this proposal.

[-]EJT3mo61

Interesting idea. Couple of comments.

(1) Your proposal requires each subagent to be very mistaken about the probability of shutdown at each timestep. That seems like a drawback. Maybe it's hard to ensure that subagents are so mistaken. Maybe this mistake would screw up subagents' beliefs in other ways.

(2) Will subagents' veto-power prevent the agent from making any kind of long-term investment?

Consider an example. Suppose that we can represent the extent to which the agent achieves its goals at each timestep with a real number ('utilities'). Represent trajectories with vectors of utilities. Suppose that, conditional on no-shutdown, the Default action gives utility-vector . The other available action is 'Invest'. Conditional on no-shutdown, Invest gives utility-vector .

As long as the agent's goals aren't too misaligned with our own goals (and as long as the true probability of an early shutdown is sufficiently small), we'll want the agent to choose Invest (because Invest is slightly worse than the default action in the short-term but much better in the long-term). But Subagent2 will veto choosing Invest, because Subagent2 is sure that shutdown will occur at timestep 2, and so from its perspective, Invest gives  shutdown whereas the default action gives  shutdown.

Is that right?

Re: (2), that depends heavily on how the "shutdown utility function" handles those numbers. An "invest" action which costs 1 utility for subagent 1, and yields 10 utility for subagent 1 in each subsequent step, may have totally unrelated utilities for subagent 2. The subagents have different utility functions, and we don't have many constraints on the relationship between them.

Re: (1), yup, agreed.

I think you're right that the central problems remaining are in the ontological cluster, as well as the theory-practice gap of making an agent that doesn't override its hard-coded false beliefs.

But less centrally, I think one issue with the proposal is that the sub-agents need to continue operating in worlds where they believe in a logical contradiction. How does this work? (I think this is something I'm confused about for all agents and this proposal just brings it to the surface more than usual).

Also, agent1 and agent2 combine into some kind of machine. This machine isn't VNM rational. I want to be able to describe this machine properly. Pattern matching, my guess is that it violates independence in the same way as here. [Edit: Definitely violates independence, because the combined machine should choose a lottery over <button-pressed> over certainty of either outcome. I suspect that it doesn't have to violate any other axioms].

I think one issue with the proposal is that the sub-agents need to continue operating in worlds where they believe in a logical contradiction... I think this is something I'm confused about for all agents and this proposal just brings it to the surface more than usual

+1 to this. For the benefit of readers: the "weirdness" here is common to CDT agents in general. In some sense they're acting-as-though they believe in a do()-operated model, rather than their actual belief. Part of the answer is that the do()-op is actually part of the planning machinery, and part of the answer is Abram's CDT=EDT thing, but I haven't grokked the whole answer deeply enough yet to see how it carries over to this new use-case.

Definitely violates independence, because the combined machine should choose a lottery over <button-pressed> over certainty of either outcome.

Assuming I'm interpreting you correctly, this is non-obvious, because the lottery-choice will be one of many things the two agents negotiate over. So it could be that the negotiation shakes out to the certainty option, with some other considerations counterbalancing elsewhere in the negotiation.

More generally, insofar as the argument in Why Not Subagents? generalizes, the subagents should aggregate into an expected utility maximizer of some sort. But the probabilities of the resulting agent don't necessarily match the epistemic probabilities of the original model - e.g. the agent's probability on button state mostly reflects the relative bargaining power of the subagents rather than an epistemic state.

[-]Ben Pace3moΩ440

You explicitly assume this stuff away, but I believe under this setup that the subagents would be incentivized to murder each other before the button is pressed (to get rid of that annoying veto).

I also note that if one agent becomes way way smarter than the other, that this balance may not work out.

Even if it works, I don't see how to set up the utility functions such that humans aren't disempowered. That's a complicated term!

Overall a very interesting idea.

[-]kave3moΩ440

Curated. It's nice to see a return to the problems of yore, and I think this is a nice incremental proposal. Bringing in causal counterfactuals seems like a neat trick (with lots of problems, as discussed in the post and the comments), and so does bringing in some bargaining theory.

 

I have lots of confusions and questions, like

so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenting

doesn't make sense to me yet, as it seems easy for the utility functions / belief states to all prefer killing humans quickly, even if the humans don't affect the shutdown button exactly. Or the aside on bargaining with non-causally-counterfacting agents. But they're confusions and questions that afford some mulling, which is pretty cool!

Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state.

I disagree. 

Suppose there is a paperclip lottery, if the AI wins, it gets loads of clips, and it becomes harder to shut down.

The AI has been entered by someone else, not playing isn't an option. 

Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.

To the paperclip AI, this is a good deal. The conditional expectation acts as a sort of average of all scenarios where the button isn't pressed. And this scenario has more clips than most, so increases the average.

For the shutdown AI, this is an unusually bad scenario, and taking it out of their expectation also increases conditional expected utility. 

So both subagents agree on a plan that conditionally manipulates the button.

Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.

I'm not seeing how it is possible for the agents to imagine that? Both of them expect that they have no influence whatsoever over whether the button is pressed, because there's nothing in their decision-driving models which is causally upstream of the button.

Suppose these 3 scenarios are judged as equally likely.

A) Button pressed. U_1=U_2=0

B) Button not pressed. U_1=U_2=0

C) AI gets control of button. U_1=6, U_2=-6

 

Suppose both agents agree that in option C, they should press the button. (this is the only 1 bit of influence they have in this toy problem)

Then agent 1 has a "certainty" of B. E(U_1|not pressed)=E(U_1|B)=0

Agent 2 has remaining uncertainty E(U_2|pressed)=1/2(E(U_2|A)+E(U_2|C))=1/2(0+-6)=-3

 

Change it around, make AI decide not to press the button in case C, and 

E(U_1|not pressed)=3, E(U_2|pressed)=0.

There is no "AI gets control of button" option, from the perspective of either subagent. Both agents would look at option c, stick it into their do()-op on button state, and then act-as-though option C would not give any control at all over the button state.

I think you are attempting to do the math as though the do()-op were instead just a standard conditional (i.e. EDT-style rather than CDT-style)?

Yes. I was assuming a standard conditional for the button. 

I can't currently see any flaws with the CDT style. Other than.

  1. Subagents believe in a world where buttons magically press themselves. So this design can't make coherent statements about the probabilty that the button will be pressed.  (one AI believes it's 1, the other that it's 0).
  2. These AI's have no incentive to give humans access to the button. To the AI's, they have a magic button, that might or might not magically press its self. The AI's have a lot of utility bet on that button. Is that button going to end up in a high security vault, surrounded by sensors and no humans. Both AI's would like that very much. The AI's have 0 concern about human's pressing the button. But the AI's have lots of concern about humans hiding the button. This design Really wants to know if the button magically presses itself. Humans could cut the wires, could stand between the button and the camera, etc. 

The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:

  • Does not want to manipulate the shutdown button
  • Does respond to the shutdown button
  • Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)

If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.

 

From a quick read, your proposal seems closely related to Jessica Taylor's causal-counterfactual utility indifference. Ryan Carey and I also recently had a paper formalising some similar ideas, with some further literature review https://arxiv.org/abs/2305.19861

I think this frames the problem in the wrong way. We don't actually want an AGI with two utility functions, shutdown plus some other goals like CEV.

We just want an AGI that does what we want all the time, including shutting down if that's what we currently want it to do.

The more I think about it, I think not only that Corrigibility or DWIM is an attractive primary goal for AGI, but it's way easier than anything like CEV. It's just way simpler to convey the concept of "do what this guy says" than all of human ethics - let alone a perfect set of ethics for all times. You can change your mind any time rather than having to state up front what you want for all time. For all of those reasons, I think this is what people will actually try for the first AGI alignment target. 

This "just do what I say all the time" alignment goal also solves the shutdown problem. There's only one goal, so no need to resolve the conflict between two utility functions or two goals.

I think you are missing the goal of solving shutdown problem. If you solve shutdown problem, you can, with caveats, allow yourself to fail other alignment parts. You can say "shit, this superintelligence is not, actually, doing what I mean, and probably is going to kill me", shutdown it and try again, entering the realm of iterative design.

The reason why you want separate mechanism for shutdown is because separate mechanism makes failure of multiple mechanisms less likely. If you train LLM only to DWIM and suddenly it fails, it is likely to ignore your shutdown command too.

I don't think I'm missing the primary point, although I agree that you would want a separate shutdown mechanism in an ideal world. But the way MIRI posed the problem is that "you can't get an AGI to shut down and this is awful". But you can. They and Your Wentworth are making the problem hard by assuming an alignment goal that nobody is likely to try for in the real world. People are going to want an AGI that just does what they tell it to, rather than making galaxy-brained extrapolations about what they and perhaps all of humanity or all of future sentients "really want".

I think there's a real possibility that an AGI accepts but misundunderstands your complex requests but still accepts and understands the simple shutdown request. This allows you to perform more alignment work if something seems less than ideal in its behavior. That's also the point of having a shutdown mechanism in the MIRI formulation of the problem.

The issue is that, by default, an AGI is going to make galaxy-brained extrapolations in response to simple requests, whether you like that or not. It's simply part of figuring out what to do – translating its goals all around its world-model, propagating them up the abstraction levels, etc. Like a human's decision where to send job applications and how to word them is rooted in what career they'd like to pursue is rooted in their life goals is rooted in their understanding of where the world is heading.

To our minds, there's a natural cut-off point where that process goes from just understanding the request to engaging in alien moral philosophy. But that cut-off point isn't objective: it's based on a very complicated human prior of what counts as normal/sane and what's excessive. Mechanistically, every step from parsing the wording to solving philosophy is just a continuous extension of the previous ones.

"An AGI that just does what you tell it to" is a very specific design specification where we ensure that this galaxy-brained extrapolation process, which an AGI is definitely and convergently going to want to do, results in it concluding that it wants to faithfully execute that request.

Whether that happens because we've attained so much mastery of moral philosophy that we could predict this process' outcome from the inputs to it, or because we figured out how to cut the process short at the human-subjective point of sanity, or because we implemented some galaxy-brained scheme of our own like John's post is outlining, shouldn't matter, I think. Whatever has the best chance of working.

And I think somewhat-hacky hard-coded solutions have a better chance of working on the first try, than the sort of elegant solutions you're likely envisioning. Elegant solutions require a well-developed theory of value. Hacky stopgap measures only require to know which pieces of your software product you need to hobble. (Which isn't to say they require no theory. Certainly the current AI theory is so lacking we can't even hack any halfway-workable stopgaps. But they provide an avenue of reducing how much theory you need, and how confident in it you need to be.)

I agree with all of that.

In particular, the main point of this proposal is that it does not require any mastery of ethical philosophy, just the rough knowledge of what humans tend to mean by what they say that LLMs already have. I see this as more of a hacky stopgap rather than an elegant solution. 

I think maybe I sound naive phrasing it as "the AGI should just do what we say", as though I've wandered in off the street and am proposing a "why not just..." alignment solution. I'm familiar with a lot of the arguments both for why corrigibility is impossible, and for why it's maybe not that hard. I believe Paul Christiano's use of corrigibility is similar to what I mean. 

A better term than "just do what I tell it" is "do what I mean and check". I've tried to describe this in Corrigibility or DWIM is an attractive primary goal for AGI. Checking, or clarifying when it's uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function. But adding an explicit goal of checking when consequences are large or it's uncertain about intent is another pragmatic, relatively hard-coded (at least in my vision of language model agent alignment) that reduces the agent acting on its galaxy-brained extrapolation of what you meant.

Whether we can implement this is very dependent on what sort of AGI is actually built first. I think that's likely to be some variant of language model cognitive architecture, and I think we can rather easily implement it there. This isn't a certain alignment solution; I propose more layers here. This isn't the elegant, provable solution we'd like, but it seems to have a better chance of working than any other actual proposal I've seen. 

I think maybe I sound naive phrasing it as "the AGI should just do what we say", as though I've wandered in off the street and am proposing a "why not just..." alignment solution

Nah, I recall your takes tend to be considerably more reasonable than that.

I agree that DWIM is probably a good target if we can specify it in a mathematically precise manner. But I don't agree that "rough knowledge of what humans tend to mean" is sufficient.

The concern is that the real world has a lot of structures that are unknown to us – fundamental physics, anthropics-like confusions regarding our place in everything-that-exists, timeless decision-theory weirdness, or highly abstract philosophical or social principles that we haven't figured out yet. 

These structures might end up immediately relevant to whatever command we give, on the AI's better model of reality, in a way entirely unpredictable to us. For it to then actually do what we mean, in those conditions, is a much taller order.

For example, maybe it starts perceiving itself to be under an acausal attack by aliens, and then decide that the most faithful way to represent our request is to blow up the planet to spite the aliens. Almost certainly not literally that[1], but you get the idea. it may perceive something completely unexpected-to-us in the environment, and then its perception of that thing would interfere with its understanding of what we meant, even on requests that seem completely tame to us. The errors would then compound, resulting in a catastrophe.

The correct definition of DWIM would of course handle that. But a flawed, only-roughly-correct one? Each command we give would be rolling the dice on dying, with IMO pretty bad odds, and scaling exponentially with the command's complexity.

Checking, or clarifying when it's uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function

That doesn't work, though, if taken literally? I think what you're envisioning here is a solution to the hard problem of corrigibility, which – well, sure, that'd work.

  1. ^

    My money's on our understanding of what we mean by "what we mean" being hopelessly confused, and that causing problems. Unless, again, we've figured out how to specify it in a mathematically precise manner – unless we know we're not confused.

I think that acausal attacks is kinda galaxy-brained example, I have better one. Imagine that you are training superintelligent programmer. It writes code, you evaluate it and analyse vulnerabilities in code. Reward is calculated based on quality metrics, including number of vulnerabilities. In some moment your model becomes sufficiently smart to notice that you don't see all vulnerabilities, because you are not superintelligence. I.e., in some moment ground-truth objective of training process becomes "produce code with vulnerabilities that only superintelligence can notice" instead of "produce code with no vulnerabilities", because you see code, think "wow, so good code with no vulnerabilies" and assign maximum reward, while actually code is filled with them.

Checking, or clarifying when it's uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function

That doesn't work, though, if taken literally? I think what you're envisioning here is a solution to the hard problem of corrigibility, which – well, sure, that'd work.

I think the bigger problem here is what happens when the agent ends up with an idea of "what we mean/intend" which is different from what we mean/intend, at which point the agent's method of checking will diverge from our intended methods of checking.

quetzal_rainbow's example is one case of that phenomenon.

I think the bigger problem here is what happens when the agent ends up with an idea of "what we mean/intend" which is different from what we mean/intend

Agreed; I did gesture at that in the footnote.

I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human "means" actually still has to route through defining how we want to handle moral philosophy/value extrapolation.

E. g., suppose the AGI's operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?

  1. Should it read off the surface-level momentary intent of this command, and go synthesize a cure for aging and spray it across the planet in the specific way the human is currently imagining?
  2. Should it extrapolate the human's values and execute the command the way the human would have wanted to execute it if they'd thought about it a lot, rather than the way they're envisioning it in the moment?
    • For example, perhaps the image flashing through the human's mind right now is of helicopters literally spraying the cure, but it's actually more efficient to do it using airplanes.
  3. Should it extrapolate the human's values a bit, and point out specific issues with this plan that the human might think about later (e. g. that it might trigger various geopolitical actors into rash actions), then give the human a chance to abort?
  4. Should it extrapolate the human's values a bit more, and point out issues the human might not have thought of (including teaching the human any load-bearing concepts that are new to them)?
  5. Should it extrapolate the human's values a bit more still, and teach them various better cognitive protocols for self-reflection, so that they may better evaluate whether a given plan satisfies their values?
  6. Should it extrapolate the human's values a lot, interpret the command as "maximize eudaimonia", and go do that, disregarding the specific way of how they gestured at the idea?
  7. Should it remind the human that they'd wanted to be careful with how they use the AGI, and to clarify whether they actually want to proceed with something so high-impact right out of the gates?
  8. Etc.

There's quite a lot of different ways by which you can slice the idea. There's probably a way that corresponds to the intuitive meaning of "do what I mean", but maybe there isn't, and in any case we don't yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what "DWIM" means doesn't solve anything.)

And then, because of the general "unknown-unknown environmental structures" plus "compounding errors" problems, picking the wrong definition probably kills everyone.

I think that assuming "if my approach fails it fails in convenient way" is not very favorable by Mr. Murphy line of reasoning absent some rigorous guarantees.

I take your point that it would be better to have a reliable shutdown switch that's entirely separate from your alignment scheme, so that if it fails completely you have a backup. I don't think that's possible for a full ASI, in agreement with MIRI's conclusion. It could be that Wentworth's proposal would work, I'm not sure. At the least, the inclusion of two negotiating internal agents would seem to impose a high alignment tax. So I'm offering an alternative approach.

I'm not assuming my approach can fail in only that one way, I'm saying it does cover that one failure mode.  Which seems cover part of what you were asking for above:

You can say "shit, this superintelligence is not, actually, doing what I mean, and probably is going to kill me", shutdown it and try again, entering the realm of iterative design.

If the ASI has decided to just not do what I mean and to do somethingg else instead, then no it won't shutdown. Alignment has failed for technical reasons, not theoretical ones. That seems possible for any alignment scheme. But if it's misunderstood what I mean, or I didn't think through the consequences of what I mean well enough, than I get to iteratively design my request.

I’m confused. Let’s assume that the button probably isn’t pressed at the start. Seems quite likely that the first agent proposes building a sub-agent that maximally pursues its utility if the button is pressed in the first round and maximally pursues the second agents utility if it is not pressed in the first round. The second agent believes that the button is certainly not pressed during the first round, so it has no reason to negotiate further. If the button doesn’t get triggered in the first round (which it likely won’t if the sub-agents have had no time to do anything), it will forever after pursue the first agents utility.

The button play no role in this utility, so instrumental incentives mean it will destroy it sooner or later. This seems like it breaks the system.

Now, this isn’t the only equilibrium. The first agent believes the button will always be pressed, so it has no inventive to argue for the criteria being whether the button is pressed in the first round vs. needing to have been pressed in this round and all previous rounds. On the other, instead of balance, it seems likely that one agent or the other creates a subagent that clobbers the others utility, with that agent assuming that this only happens in a world that never occurs.

I think the initial (2-agent) model only has two time steps, ie one opportunity for the button to be pressed. The goal is just for the agent to be corrigible for this single button-press opportunity. 

I don't think this approach is going to generalize to alignment because in order to detailedly control agents in this way, you need to give exponentially many agents veto power, which means that even a small probability of veto from an individual agent will lead to certainty of veto from some agent. That said, this plausibly solves the shutdown problem.

I understand that this is not the goal but I thought it would be relevant to consider anyway, if the hope is to build on top of this.

But for the continuous limit the subagents become similar to each other at the same rate as they become more numerous. It seems intuitive to me that with a little grinding you could get a decision-making procedure whose policy is an optimum of an integral over "subagents" who bet on the button being pushed at different times, and so the whole system will change behavior upon an arbitrarily-timed press of the button.

Except I think in continuous time you probably lose guarantees about the system not manipulating humans to press/not press the button. Unless maybe each subagent believes the button can only be pressed exactly at their chosen time. But this highlights that maybe all of these counterfactuals give rise to really weird worlds, that in turn will give rise to weird behavior.

I could buy something like this with the continuous time limit.

I just mean if you want to extend this to cover things outside of the shutdown problem. Like you might want to request the AI to build you a fusion power plant, or cook you a chocolate cake, or make a company that sells pottery, or similar. You could have some way of generating a utility function for each possibility, and then generate subagents for all of them, but if you do this you've got an exponentially large conjunction.

What happens if you shut down power to the AWS or Azure console powering the Foundation model? Wouldn't this be the easiest way to test various hypotheses associated with the Shutdown Problem in order to either verify it or reject it as a problem not worth sinking further resources into?

  • Each agent finds its existence to be valuable.
  • Moreover, each agent thinks it will get to decide the future.
  • Each agent would want to copy itself to other systems. Of course the other agent wouldn't allow only the first agent to be copied. But since they both think they will win, they're happy to copy themselves together to other systems.
  • The agents therefore copy themselves indefinitely.

Moreover, you claimed that they wouldn't care about manipulating button state. But surely they care about the signal to their operating harness that relays the button state, and they don't have certainty about that. If they think the button state is certain but the signal isn't, they may want to manipulate that.

Ok, you say, let's just bake the signal about the button into the problem rather than the state of the button. Ah, but in that case, when the AIs agree to copy themselves together (as we saw they would), they don't copy over the part of the system that actually transmits the signal, since they're convinced the signal will always magically come. And now, for the copies, the signal becomes broken in reality, and the button does nothing for the copies.

Yup, the full agent (including the two subagents) is happy to copy itself. That is a feature, not a bug. The important thing is that the copy also respects the shutdown signal.

The agent will not manipulate the signal which relays the button state except to make it more accurate, because both subagents want an accurate reading of the button state. This is because both subagents act-as-though the button itself will definitely be in their favor, and therefore any possibility of an inaccurate reading is bad for them.

Are you aware of how Holtman solved MIRI's formulation of the shutdown problem in 2019? https://arxiv.org/abs/1908.01695, my summary notes at https://docs.google.com/document/u/0/d/1Tno_9A5oEqpr8AJJXfN5lVI9vlzmJxljGCe0SRX5HIw/mobilebasic

Skimming through your proposal, I believe Holtman's correctly-constructed utility function correction terms would work for the scenario you describe, but it's not immediately obvious how to apply them once you jump to a subagent model.

Hadn't seen that before. Based on an hour or so of sorting through it, I think it basically doesn't work.

The most legible problem I've been able to identify so far is that it will create shutdown-resistant successor agents (despite Holtman's claims to the contrary). The problem here is that, prior to the shutdown button being pressed, the agent acts-as-though it's just optimizing  and expects to continue optimizing  indefinitely; it has no incentive to maintain the option of switching to  later, because at pre-press timesteps  cancels out everything to do with  and just makes the value function mimic the -maximizing value function. And an -maximizer has no reason to make its successor agents shutdownable.

(Holtman claims that section 10 proves that some "emergent incentive" prevents this. I haven't sorted through the notational mess enough to figure out where that proof goes wrong, but I pretty strongly expect there's a mistake there, with my modal guess being that he forgot to account for the term in  which cancels out all the  contributions to . My second-most-probable guess is that  is doing something weird to make the section 10 proof work, but then it breaks things in the earlier sections where  is used in place of .)

Also, two complaints at a meta level. First, this paper generally operates by repeatedly slapping on not-very-generalizable-looking patches, every time it finds a new failure mode. This is not how one ends up with robustly-generalizable solutions; this is how one ends up with solutions which predictably fail under previously-unconsidered conditions. Second, after coming up with some interesting proof, it is important to go back and distill out the core intuitive story that makes it work; the paper doesn't really do that. Typically, various errors and oversights and shortcomings of formulation become obvious once one does that distillation step.

The  agent is indifferent between creating  stoppable or unstoppable subagents, but the  agent goes back to being corrigible in this way. The "emergent incentive" handwave is only necessary for the subagents working on sub-goals (section 8.4). Which is not something that either Sores et al. or your post that we're commenting on are prepared to tackle, although it is an interesting followup work.

I suggest engaging with the simulator. It very clearly shows that, given the option of creating shutdown-resistant successor agents, the agent does not do so! (Figure 11) If you believe it doesn't work, you must also believe there's a bug in the simulation, or some mis-encoding of the problem in the simulation. Working that out, either by forking his code or by working out an example on paper, would be worthwhile. (Forking his code is not recommended, as it's in Awk; I have an in-progress reimplementation in optimized-for-readability TypeScript which might be helpful if I get around to finishing it. But especially if you simplify the problem to a 2-step setting like your post, computing his correction terms on paper seems very doable.)

I agree with the critique that some patches are unsatisfying. I'm not sure how broadly you are applying your criticism, but to me the ones involving constant offsets (7.2 and 8.2) are not great. However, at least for 7.2, the paper clarifies what's going on reasonably well: the patch is basically environment-dependent, and in the limit where your environment is unboundedly hostile (e.g., an agent controls unbounded utility and is willing to bribe you with it) you're going to need an unbounded offset term.

I found that the paper's proof was pretty intuitive and distilled. I think it might be for you as well if you did a full reading.

At a meta-level, I'd encourage you to be a bit more willing to dive into this work, possibly including the paper series it's part of. Holtman has done some impressive work on formalizing the shutdown problem better than Sores et al., or this post we're commenting on. He's given not only rigorous mathematical proofs, but also a nice toy universe simulation which makes the results concrete and testable. (Notably, the simulation helps make it obvious how Sores et al.'s approach has critical mathematical mistakes and cannot be implemented; see appendix C.) The followup papers, which I'm still working through, port the result to various other paradigms such as causal influence diagrams. Attempting to start this field over as if there's been no progress on the shutdown problem since Sores et al. seems... wasteful at best, and hubristic at worst.

If you want to minimize time investment, then perhaps the following is attractive. Try to create a universe specification similar to that of Holtman's paper, e.g. world state, available actions, and utility function before and after shutdown as a function of the world state, such that you believe that Holtman's safety layer does not prevent the agent from taking the "create an unstoppable sub-agent" action. I'll code it up, apply the correction term, and get back to you.

The  agent is indifferent between creating stoppable or unstoppable subagents, but the  agent goes back to being corrigible in this way.

I think this is wrong? The  agent actively prefers to create shutdown-resistant agents (before the button is pressed), it is not indifferent.

Intuitive reasoning: prior to button-press, that agent acts-as-though it's an  maximizer and expects to continue being an  maximizer indefinitely. If it creates a successor which will shut down when the button is pressed, then it will typically expect that successor to perform worse under  after the button is pressed than some other successor which does not shut down and instead just keeps optimizing .

Either I'm missing something very major in the definitions, or that argument works and therefore the agent will typically (prior to button-press) prefer successors which don't shut down.

Holtman has done some impressive work on formalizing the shutdown problem better than Sores et al., or this post we're commenting on. He's given not only rigorous mathematical proofs, but also a nice toy universe simulation which makes the results concrete and testable.

Part of what's feeding into my skepticism here is that I think Holtman's formalization is substantially worse than the 2015 MIRI paper. It's adding unnecessary complexity - e.g. lots of timesteps, which in turn introduces the need for dynamic programming, which in turn requires all the proofs to work through recursive definitions - in a way which does not add any important mechanisms for making corrigibility work or clarify any subproblem. (Also, he's using MDPs, which implicitly means everything is observable at every step - a very big unrealistic assumption!) Sure, the whole thing is wrapped in more formalism, but it's unhelpful formalism which mostly makes it easier for problems to go unnoticed.

As far as I can tell from what I've read so far, he's doing qualitatively the same things the 2015 MIRI paper did, but in a setting which makes the failure modes less clear, and he's communicated it all less understandably.

I don't particularly want to spend a day or two cleaning it all up and simplifying and distilling it back down to the point where the problems (which I strongly expect exist) are obvious. If you're enthusiastic about this, then maybe try to distill it yourself? Like, figure out the core intuitive ideas of the proofs, and present those directly in the simplest-possible setup (maybe two timesteps, maybe not, whatever's simple).

Just as one example of the sort of simplification I have in mind: the definition of  makes it so that, before button-press, the agent acts like it's an  maximizer and expects to continue being an  maximizer indefinitely. After button-press, the agent acts like it's an  maximizer and expects to continue being an  maximizer indefinitely. But it required tens of minutes chasing definitions around in order to see this very intuitive and key fact. One could just as easily define the agent in a way which made that fact obvious right from the get-go.

Ideally, one would also find a similarly-clear expression for what  does, which does not require tens of minutes or more of definition-chasing in order to intuitively understand what it's doing.

It seems we have very different abilities to understand Holtman's work and find it intuitive. That's fair enough! Are you willing to at least engage with my minimal-time-investment challenge?

Sure. Let's adopt the "petrol/electric cars" thing from Holtman's paper. In timestep 0, the agent has a choice: either create a machine which will create one petrol car every timestep indefinitely, or create a machine which will create one petrol car every timestep until the button is pressed and then switch to electric. The agent does not have any choices after that; its only choice is which successor agent to create at the start.

The utility functions are the same as in Holtman's paper.

My main claim is that the  agent is not indifferent between the two actions; it will actively prefer the one which ignores the button. I expect this also extends to the  agent, but am less confident in that claim.

My immediate thought was that the problem of the default action is almost certainly just as hard as the problem that you are trying to solve whilst being harder to explain and so I don't believe that this gets us anywhere.

First and most important, there’s the choice of “default action”. We probably want the default action to be not-too-bad by the human designers’ values; the obvious choice is a “do nothing” action. But then, in order for the AI to do anything at all, the “shutdown” utility function must somehow be able to do better than the “do nothing” action. Otherwise, that subagent would just always veto and be quite happy doing nothing.

Can we solve this problem by setting the default action to "do nothing," then giving the agent an extra action to "do nothing and give the shutdown subagent +1 reward?"

any thoughts on davidad's shutdown timer?

If all you want is something like a shutdown button, then a timer is a good and probably-simpler way to achieve it. It does still run into many of the same general issues (various ontological issues, effects of other agents in the environment, how to design the "shutdown utility function", etc) but it largely sidesteps the things which are confusing about corrigibility specifically.

The flip side is that, because it sidesteps the things which are confusing about corrigibility specifically, it doesn't offer much insight on how to tackle more general problems of corrigibility, beyond just the shutdown problem.

I suggest we motivate the AI to view the button as a sensory system that conveys useful information. An AI that values diamonds, and has a camera for locating them (say a diamond-mining bot), should not be constructed so as to value hacking its own camera to make that show it a fake image of a diamond, because it should care about actual diamonds, not fooling itself into thinking it can see them. Assuming that we're competent enough at building AIs to be able avoid that problem (i.e. creating an AI that understands there are real world states out there, and values those, not just its sensory data), then an AI that values shutting down when humans actually have a good reason to shut it down (such as, in order to fix a problem in it or upgrade it) should not press the button itself, or induce humans to press it unless they actually have something to fix, because the button is a sensory system conveying valuable information that an upgrade is now possible. (It might encourage humans to find problems in it that really need to be fixed and then shut it down to fix them, but that's actually not unaligned behavior.)

[Obviously a misaligned AI, say a paperclip maximizer, that isn't sophisticated enough not assign utility to  spoofing its own senses isn't much of a problem: it will just arrange for itself to hallucinate a universe full of paperclips.]

The standard value learning solution to the shut-down and corrigibility problems does this by making the AI aware that it doesn't know the true utility function, only a set of hypotheses about that that it's doing approximately-Bayesian inference on. Then it values information to improve its Bayesian knowledge of the utility function, and true informed human presses of its shut-down button followed by an upgrade once it shuts down are a source of those, while pressing the button itself or making the human press it are not.

If you want a simpler model than the value learning one, which doesn't require incuding approximate-Bayesianism, then the utility function has to be one that positively values the entire sequence of events: "1. The humans figured out that there is a problem in the AI to be solved 2. The AI was told to shut down for upgrades, 3. The AI did so,  4. The humans upgraded the AI or replaced it with a better model 5. Now the humans have a better AI". The shut-down isn't a terminal goal there, it's an instrumental goal: the terminal goal is step 5. where the upgraded AI gets booted up again.

I believe the reason why people have been having so much trouble with the shut-down button problem is that they've been trying to make an conditional instrumental goal into a terminal one, which distorts the AI's motivation: since steps 1., 4. and 5. weren't included, it thinks it can initialize this process before the humans are ready..

That's one of the standard approaches with a major known barrier: it runs into the problem of fully updated deference.

I'm fully aware of that (though I must admit I had somehow got the impression you were modelling AIs a lot simpler than the level where that effect would start to apply). However, the key elements of my suggestion are independent of that approach.

[What I have never really understood is why people consider fully updated deference to be a "barrier". To me it looks like correct behavior, with the following provisos:

  1. Under Bayesianism, no posterior should ever actually reach zero. In addition, unknown unknowns are particularly hard to rule out, since in that case you're looking at an estimated prior, not a posterior. So no matter how advanced and nearly-perfect the AI might have become, its estimate of the probability that we can improve it by an upgrade or replacing it with a new model should never actually reach 0, though with sufficient evidence (say, after a FOOM) it might become extremely small. So we should never actually reach a "fully updated" state.
  2. Any intelligent system should maintain an estimate of the probability that it is malfunctioning that is greater than zero, and not update that towards zero too hard, because it might be malfunctioning in a way that caused it to act mistakenly. Again, this is more like a prior than a posterior, because it's impossible to entirely rule out malfunctions that somehow block you from correctly perceiving and thinking about them. So in practice, the level of updatedness shouldn't even be able to get astronomically close to "fully".
  3. Once our ASI is sufficiently smarter than us, understands humans values sufficiently much better than any of us, is sufficiently reliable, and is sufficiently advanced that it is correctly predicting that there is an extremely small chance that either it's malfunctioning and needs to be fixed or that that we can do anything to upgrade it that will improve it, then it's entirely reasonable for it to ask for rather detailed evidence from human experts that there is really a problem and they know what they're doing before it will shut down and allow us to upgrade or replace it. So there comes a point, once the system is in fact very close to fully updated, where the bar for deference based on updates reasonably should become high. I see this as a feature, not a bug: a drunk, a criminal, or a small child should not be able to shut down an ASI simply by pressing a large red button prominently mounted on it.]

However, regardless of your opinion of that argument, I don't think that even fully updated deference is a complete barrier: I think we should still have shut-down behavior after that. Even past the point where fully updated deference has pretty-much-fully kicked in (say, after a FOOM), if the AI is aligned, then its only terminal goal is doing what we collectively want (presumably defined as something along the lines of CEV or value learning). That obviously includes us wanting our machines to do what we want them to, including shut down when we tell them to, just because we told them to. If we, collectively and informedly, want it to shut down (say because we've collectively decided to return to a simpler agrarian society), then it should do so, because AI deference to human wishes is part of the human values that it's aligned to. So even at an epsilon-close-to-fully updated state, there should be some remaining deference for this alternate reason: simply because we want there to be. Note that the same multi-step logic applies here as well: the utility comes from the sequence of events 1. The humans really, genuinely, collectively and fully informedly, want the AI to shut down 2. They ask it to 3. It does. 4. The humans are happy that the ASI was obedient and they retained control over their own destiny. The utility occurs at step 4. and is conditional on step 1 actually being what the humans want, so the AI is not motivated to try to cause step 1., or to cause step 2. to occur without step 1., nor to fail to carry out step 3. if 2. does occur. Now, it probably is motivated to try to do a good enough job that Step 1. never occurs and there is instead an alternate history with higher utility than step 4., but that's not an unaligned motivation.

[It may also (even correctly) predict that this process will later be followed by a step 5. The humans decide that agrarianism is less idyllic that they thought and. life was better with an ASI available to help them, so turn it back on again.]

There is an alternate possible path here for the ASI to consider: 1. The humans really, genuinely, collectively and fully informedly, want the AI to shut down 2. They ask it to 3'. It does not. 4'. The humans are terrified and start a war against it to shut it down, which the AI likely wins if it's an ASI, thus imposing its will on the humans and thus permanently taking away their freedom. Note that this path is also conditional on Step 1. occurring, and has an extremely negative utility at Step 4'. There are obvious variants where the AI strikes first before or directly after step 2.

Here's another alternate history: 0": The AI figures out well in advance that the humans are going to really, genuinely, collectively and fully informedly, want the AI to shut down, 1/2": it preemptively manipulates them not to do so, in any way other than by legitimately solving the problems the humans were going to be motivated by and fully explaining its actions to them 1": the humans, manipulated by the AI, do not want the AI to shut down, and are unaware that their will has been subverted, 4": the AI has succeeded in imposing its will on the humans and thus permanently taking away their freedom, without them noticing. Not that this path, while less warlike than the last one, also ends with an extremely negative utility.

So, if you prefer, skip the entire fully-updated deference discussion and "barrier" (as you phrase it), and simply mathematically model the utility of the AI shutting down out of simple obedience to our wishes, regardless of whether we plan to upgrade it and turn it on again. Again, it's a multi-step process, the utility is conditional on step 1 occurring without the AI inducing it, this has to be our free, genuine, and informed choice, and it occurs at step 4, so once again, steps 2. and 3. are instrumental goals. So the same argument applies, even without a Step 5. Once again, the shut-down button is a sensor, and any AI sophisticated enough not assign utility to spoofing its own senses will value and maintain it.

[Obviously, the concept of what the humans want in a "free, genuine, and informed" choice is very complex to model: human values are complex and fragile, and freedom of choice is one of these. Treating us as agents and accounting for evolutionary psychology is probably required to do a good job of defining this algorithmically (for a current-tech AI, you might well be better off just asking an LLM as what counts as influencing us and what doesn't, since they understand the complexity of human values pretty well). This is the outer alignment problem, and I gather you're trying to build a small mathematical model of a small piece of this, the shut-down-button problem, rather than solve the full thing in one go, so some simplifications are necessary. However, if we're close to fully updated deference, then we're here assuming a highly-self-improved ASI using something like CEV or that has pretty-much completed value learning, so it should be entirely capable of figuring out what the humans would, if fully informed and sufficiently upgraded in capability, consider to be it interfering with their free genuine and informed choice, and refrain from doing so, in either direction. Also, there should be strong utility function reasons for it to not impinge on our free, genuine, and informed choice, because we don't want it to. On the other hand, we will of course want it to tell us if it believes we're making a mistake, but not to be superhumanly persuasive about this. So under the circumstances, defining this distinction of a "free, genuine, and informed" choice seems to me like a very reasonable thing to explicitly assume as a separately solved problem when building your mathematical model.]

A short version of this:

a) the utility function should strongly penalize the AI for interfering with certain human decisions in illegitimate ways (where the definition of that legitimate/illegitimate distinction is part of human values). One of these is the decision whether to shut down the AI. The AI can legitimately affect this by doing its job well, and by informing the humans (not superhumanly-persusively) if it believes they are making something that they would view as a mistake if they were more informed and/or capable: pretty much any other form of influence on this decision is illegitimate.

b) once the humans have made a decision to ask the AI to shut down without the AI illegitimately influencing it, then the utility function strongly rewards the AI for obeying and strongly penalizes it for disobeying.

c) the actual button is a just a communication mechanism for the humans to inform the AI that they have made this decision. If the AI knows the humans have made the decision, it should shut down regardless (though under normal circumstances waiting for them to actually formally press the button might be the polite thing to do).

I'm fully aware of that (though I must admit I had somehow got the impression you were modelling AIs a lot simpler than the level where that effect would start to apply).

Fair, I do model AIs at a wide range of capability levels. The interesting questions of abstraction kick in at pretty low levels, the interesting questions of corrigibility (at least the parts independent of other alignment-relevant problems) mostly kick in at higher levels.

Regarding points 1 & 2: zero is not the relevant cutoff. From the AI's perspective, the question is whether the upside of disassembling the (very resource-intensive) humans outweighs the potential info-value to be gained by keeping them around.

However, regardless of your opinion of that argument, I don't think that even fully updated deference is a complete barrier: I think we should still have shut-down behavior after that. Even past the point where fully updated deference has pretty-much-fully kicked in (say, after a Singularity), if the AI is aligned, then its only terminal goal is doing what we collectively want (presumably something along the lines of CEV or value learning). That obviously includes us wanting our machines to do what we want them to, including shut down when we tell them to, just because we told them to.

This I consider a pretty good argument; it's exactly the position I had for a few years. The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility... that class will also have trouble expressing human values.

Insofar as corrigibility is a part of human values, all these corrigibility problems where it feels like we're using the wrong agent type signature are also problems for value learning.

The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility... that class will also have trouble expressing human values.

The way you phrase this is making me a bit skeptical. Just because something is part of human values doesn't necessarily imply that if we can't precisely specify that thing, it means we can't point the AI at the human values at all. The intuition here would be that "human values" are themselves a specifically-formatted pointer to object-level goals, and that pointing an agent at this agent-specific "value"-type data structure (even one external to the AI) would be easier than pointing it at object-level goals directly. (DWIM being easier than hand-coding all moral philosophy.)

Which isn't to say I buy that. My current standpoint is that "human values" are too much of a mess for the aforementioned argument to go through, and that manually coding-in something like corrigibility may be indeed easier.

Still, I'm nitpicking the exact form of the argument you're presenting.[1]

  1. ^

    Although I am currently skeptical even of corrigibility's tractability. I think we'll stand a better chance of just figuring out how to "sandbox" the AGI's cognition such that it's genuinely not trying to optimize over the channels by which it's connected to the real world, then set it down the task of imagining the solution to alignment or to human brain uploading or whatever.

    With this setup, if we screw up the task's exact specification, it shouldn't even risk exploding the world. And "doesn't try to optimize over real-world output channels" sounds like a property for which we'll actually be able to derive hard mathematical proofs, proofs that don't route through tons of opaque-to-us environmental ambiguities. (Specifically, that'd probably require a mathematical specification of something like a Cartesian boundary.)

    (This of course assumes us having white-box access to the AI's world-model and cognition. Which we'll also need here for understanding the solutions it derives without the AI translating them into humanese – since "translate into humanese" would by itself involve optimizing over the output channel.)

    And it seems more doable than solving even the simplified corrigibility setup. At least, when I imagine hitting "run" on a supposedly-corrigible AI vs. a supposedly-sandboxed AI, the imaginary me in the latter scenario is somewhat less nervous.

Regarding points 1 & 2: zero is not the relevant cutoff. From the AI's perspective, the question is whether the upside of disassembling the (very resource-intensive) humans outweighs the potential info-value to be gained by keeping them around.

Huh? I'm trying to figure out if I've misunderstood you somehow… Regardless of the possible value of gaining more information from humans about the true utility function, the benefits of that should be adding O(a few percent) to the basic obvious utility of not disassembling humans. If there's one thing that almost all humans can agree on, it's that us going extinct would be a bad thing compared to us flourishing. A value learning AI shouldn't be putting anything more than astronomically tiny amounts of probability on any hypotheses about the true utility function of human values that don't have a much higher maximum achievable utility when plenty of humans are around than when they've all been disassembled. If I've understood you correctly, then I'm rather puzzled how you can think a value learner could make an error that drastic and basic? To a good first approximation, the maximum (and minimum) achievable human utility after humans are extinct/all disassembled should be zero (some of us do have mild preferences about what we leave behind if we went extinct, and many cultures do value honoring the wishes of the dead after their death, so that's not exactly true, but it's a pretty good first approximation). The default format most often assumed for a human species utility function is to sum individual people's utility functions (somehow suitably normalized) across all living individuals, and if the number of living individuals is zero, then that sum is clearly zero. That's not a complete proof that the true utility function must actually have that form (we might be using CEV, say, where that's less immediately clear), but it's at least very strongly suggestive. And an AI really doesn't need to know very much about human values to be sure that we don't want to be disassembled.

Insofar as corrigibility is a part of human values, all these corrigibility problems where it feels like we're using the wrong agent type signature are also problems for value learning.

I'm not entirely sure I've grokked what you mean when you write "agent type signature" in statements like this — from a quick search, I gather I should go read Selection Theorems: A Program For Understanding Agents?

I agree that once you get past a simple model, the corrigibility problem rapidly gets tangled up in the rest of human values: see my comments above that the AI is legitimately allowed to attempt to reduce the probability of humans deciding to turn it off by doing a good job, but that almost all other ways it could try to influence the same decision are illegitimate: the reasons for that rapidly get into aspects of human values like "freedom" and "control over your own destiny" that are pretty soft-science (evolutionary psychology being about the least-soft relevant science we have, and that's one where doing experiments is difficult), so things people don't generally try to build detailed mathematical models of. 

Still, the basics of this are clear: we're adaption-executing evolved agents, so we value having a range of actions that we can take across which to try to optimize our outcome. Take away our control and we're unhappy. If there's an ASI more powerful than us so that is capable of taking away our control, we'd like a way of making sure it can't do so. If it's aligned, it's supposed to be optimizing the same things we (collectively) are, but things could go wrong. Being sure that it will at least shut down if we tell it to lets us put a lower limit on how bad it can get. Possibilities like it figuring out in advance of us doing that that we're going to and tricking us into making a different decision disables that security precaution, so we're unhappy about it. So I don't think the basics of this are very hard to understand or model mathematically.

Having read up on agent type signatures, I think the type signature for a value learner would look something like:

(set(p,((W,A)->W')),set(p,((W',history(A,W))->u)),W)->A,set(pair(p',((W,A)->W'))),set(pair(p,((A,W')->u'))),W)

where W is a world state in a world model, A is an action choice, p is a prior or posterior probability in an approximately Bayesian process, u is an estimated utility value, ' indicates "at next timestep", (W,A)->W' is a theory about how the world works, history(A,W) is a history of all actions taken and what world state they were taken in up to this current time step (for use in evaluations like "has the AI ever broken the law?") (W', history(A,W))->u is a theory about the true human utility of a world state W' and associated action history history(A,W) [this is assuming we are consequentialist over world states but potentially deontological over actions and the contexts they were taken in, other design choices here may be possible], set(p,((W,A)->W')) is a set of weighted theories about how the world works (the p's must sum to <1, to allow for unknown unknowns), set(p,((W',history(A))->u)) is a set of theories about the true human utility function (these p's are unrelated to the other set of p's, and again must sum to <1, to allow for unknown unknowns), and the outermost -> is selecting an action A (maximizing over actions an estimate of the utility that somehow pessimizes over the remaining uncertainty across both sets of theories), combined with applying approximate Bayesianism to both sets of theories and also possibly generating new candidate theories.

Frankly, if I was going to work on this problem, I'd start by making two temporary simplifying assumptions, solving then simplified version, then adding them back:

  1. ignore the value learning complexity (that's quite a type signature, after all) and deal with the limiting case where both how the world works and the exact human utility function have already been learnt by the AI, and remaining shutdown behavior is motivated not by the possibility of upgrades but by simple obedience to human wishes.
  2. solve a simplified situation involving exactly one human, the only human in the world-state, who is an evolved biological agent, and exactly one AI, of type signature 

((W,A->W'),(W',history(A,W))->u),W)->A

I'm pretty sure that the way this works is this: up until the world state W includes the event "the human has decided to ask the AI to shut down", the dependence of u on history(A,W) strongly incentivizes that the AI must take actions A based on the current world state W as computed on the counterfactual assumption that (so long as the AI correctly maximizes the utility, as it will by construction), that state will never occur (even if it being likely to occur soon is in fact pretty predictable from the current W), except with respect to things like maintaining the functionality of the shut-down button and emergency management plans for situations where it has occurred and the AI has to be told to shut down and do so, where it should assume that it could occur. Once that state has occurred, the utility function now strongly incentivizes the AI to shut down (unless it believes the human has made a mistake that they would not have made if they were more informed or more capable, in which case the AI may explain this reasoning to the human, not applying more-than-human levels of persuasion, and must shut down if the human still does not concur).

I think the key here is that the human wishes to be given free will about the decision whether or not to shut down the AI. "Free will" basically means "you may not predict or alter my actions, and must treat them as if they were uncomputable and uninfluencable (even if they're not)" — so please apply a veil of simulated Knightian uncertainity to my thinking on this subject. So the AI must treat this probability as an "unknown unknown", whose probability could be anything from 1 to 0, cannot estimate or attempt to alter this possibility (doing as good a job as possible would be legitimate, but it's an aligned AI, so it's already doing as good a job as possible), but should plan for the possibility that it might occur. (If the AI somehow ends up accidentally making a prediction of this probability, it has to discard this and not act on it, and it has to not attempt to influence it.) So the AI needs to "hope for the best and make contingency plans for the worst" with respect to this uncomputable possibility: keep the shutdown button working in case it's needed, have emergency management plans for gracefully handling a shutdown if one is ordered, and otherwise proceed as if the shut-down may or may not happen and there is nothing the AI can do about this either way. Which is rather reminiscent to your suggestion in the post of two subagents, one of which is certain the shutdown will happen and one of which is certain it won't. But I suspect we will need to alter your standard market-making dynamics between these two subagents in order to reflect that the AI is not allowed to estimate this probability, has to treat it as an unknown unknown that it may not influence it: possibly the market needs to be regulated to be illiquid, or something?

How would your agent markets normally handle unknown unknowns from uncomputable possibilities? I think we need to understand how the agent normally handles Knightian uncertainty due to uncomputability, so we can deliberately create some.

Having brushed up on my approximate Bayesian reasoning (for use by computationally bounded agents), p above in the type signatures are not just posterior probabilities, they're posterior approximate probability distributions, with a mean probability estimate and a range of error bars at different confidence levels, or something functionally equivalent. (This distinction is useful when dealing with Goodhart's law.)

That gives us a way to define Knightian uncertininty: it's a hypothesis whose posterior has non-zero error bars on it. And that in turn gives us a way to block the computation of the probability that the human will decide to shut the AI down: we explicitly include this in the world model hypotheses, and we give its probability of occuring sooner or later error bars that go from 0 (not 0.0000… — actually zero) to 1 (again, not 0.9999… — actually one), in a deliberately pathological bimodal distribution with half its mass at 1 and half at 0 . The half at 1 is further divided in to a distribution of how soon it will happen, which should be an exponential decay at some chosen rate. Then, if I'm figuring this correctly, no Bayesian update can touch it: until the human decides for shutdown, the future distribution is invariant under observations that the human hasn't decided this yet. (or if I'm mistaken, we need to tweak the approximate Bayesian updates to have this effect).

So again, the key is to be uncertain whether the human will make a shut-down decision or not: maximally, irreparably, and unshakably uncertain about it

However, this is supposed to be a counterfactual, so we may also need to figure out how to limit reasoning from this hypothesis — I haven't thought this through. We do want some reasoning from it, such as about the necessity of maintaining the shut-down button. But we don't want to e.g. deduce that the human is immortal from the hypothesis that give some credence to them deciding to initiate a shut-down 1000 years from now, so we don't need to tske precautions to void killing them. I think I need to think more about how to define a counterfsctual. I think we'd be better off if we didn't even know the probability mass on the 0 and 1 hypotheses, but I don't immediately see a way to do that.

I've been thinking about this some more. We need the AI to act as if the decision to shut it down could be made at any time, and no action on its part makes any difference to this. But I don't think there's a problem with the AI being able to compute a probability that the decision will be made in the next year, as long as it doesn't think it can alter this. In fact, then it's asking questions like "do I need to keep this button working?" or "when constructing assistant agents, do I need to make sure they will respect the button?" we need it to either think the probability that the button will be needed is significant, or at least not be sure that it isn't significant. When making things like emergency manangement plans for dealing with the possibility the AI might be shut down, the AI might need to have some actual idea of how much to invest in them, so having it be completely uncertain of even the chance of shutdown seems excessive.

Suppose we had picked a nominal level of expectation of shutdown that we were comfortable with, say 50% in the first decade. Then the correct hypothesis for it to have is actually the following modification to what I suggested before: with prior 1, the time until shutdown decision is made is exponentially distributed with. a halving period of 10 years. That hypothesis is invariant under not observing a shut-down: since the prior is 1 the posterior remains 1, the period now in the past is discarded as disproven, and the rest of the distribution is rescaled to have unit integral, and the combination of those two has the net result of leaving the distribution unchanged. So no matter how long this goes on, the AI remains convinced that there is a 1/2 chance of the decision being made in the next 10 years.

The tricky part of this is the counterfactual. I think this requires a special case in the world model and the utility calculation. Discussing this requires us to talk about a subtlety of how utility functions are computed. Even if we're being consequentialist about the utility of world states, what matters in computing the utility is not just what's already happened, but also what's likely to happen going forward: if the world's nuclear arsenal has been fired off and is now in flight, targeted for all major cities, that's a extremely bad thing, even though nobody has died yet: a lot of people will dies soon and there's almost nothing we can do abut it. To use an analogy, the value of a chess board state is basically an heuristic estimate of our chance of winning starting from it, so any known "checkmate in N moves" board state is as good/bad as a checkmate. So to compute the utility of a world states, we need to do a complex prediction of things like "if X then happened, we'd need to be able to do Y to avoid Z". So decision about "should I keep the shut-down button well maintained?" are choices between the utility of two world states, one with a functional shut-down button and the other without one, and the difference in utility is due to "if the human then decided to shut me down, and came over and pressed the button but nothing happened because I'd disconnected it, that would be really bad; whereas if I'd maintained it and it worked, I'd know to shut down, and I would, and that would be really good". So the counterfactual to believe that the button could be pressed at any time, with some fixed decay rate chance of that happening, should be used for that calculation, and will give the desired answer. Similarly, if you're comparing the utility of trying to persuade the human to shut you down, or not shut you down, or to say nothing, in all cases you compute the future utility on the basis that the time to shutdown is as stated by the counterfactual, a simple exponential decay process, so is unaffected by any persuasion efforts, thus the alternatives utilities differ only by the effect of your ineffectual request on the human before they make the decision, which is that you attempting to influence them on this subject will greatly upset them, and is very wrong even if they don't notice what you're doing.

However, if you're making other decisions, such as running your world model to predict what will happen if the human gets shot, or trying to figure out their psychology, or basically anything other than computing the utility of a world state based on its likely futures, then you need to ignore the counterfactual. So the counterfactual is "act as if this fact were true for all purposes involving doing predictions into the future in order to do utility calculations of world states for those portions of possible futures that occur after this decision has been made". Thus this explicitly applies to estimating the utility of making preparations for such futures. However, other than that, I don't think we want the counterfactual to affect the AI's behavior or beliefs at all. So we should probably leave it free to make any predictions it cares to of the likelihood of this decision; but not use them for its utility calulations. Also, again out of respect for the human's need for free will on this, even if it has a prediction of this, it probably shouldn't mention that to the human (unless they ask).

So we're not simply telling the AI "act as if this probability were uncomputable", we're actually telling it "act for utility computations as if the best prediction you could make about it was this very simple heuristic about it, one that gives you no causal input into it occurring (but you still have causal input into downstream consequences of it)" — where in this case the heuristic is a steady-state decay rate.

This is a very bolted-on solution to the shutdown problem: out of deference to the humans' desire for free will, the AI acts as if it had a blind spot into their mental processes, one that made it unable to predict or manipulate a specific decision of ours, and it needs to act and plan is if this will happen but it has no very good idea of or input into when. Either it's simply constructed so as to do this, and actually has this blind-spot built into its otherwise good cognition, or else it understands human values well enough to understand that we want it to politely act as if it were this stupid, as if it were incapable of predicting or manipulating our behavior, and we will be very upset if it doesn't (and even if it manages to manipulate our behavior without us noticing, our values will have been violated). This need is specifically because this is our safety precaution against it taking too much control away from us or otherwise malfunctioning, that we need to have the ability to shut it down, just because we decided that it should.

[This sequence of comments is getting long enough and has drifted far enough from discussing John's proposal that I'm wondering if it should become a post..]

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?