I don't have a better solution right now, but one problem to note is that this agent will strongly bet that the button will be independent of the human pressing the button. So it could lose money to a different agent that thinks these are correlated, as they are.
That's not necessarily a deal-breaker; we do expect corrigible agents to be inefficient in at least some ways. But it is something we'd like to avoid if possible, and I don't have any argument that that particular sort of inefficiency is necessary for corrigible behavior.
The patch which I would first try is to add another subagent which does not care at all about what actions the full agent takes, and is just trying to make money on the full agent's internal betting markets, using the original non-counterfacted world model. So that subagent will make the full agent's epistemic probabilities sane.
... but then the question is whether that subagent induces button-influencing-behavior. I don't yet have a good argument in either direction on that question.
To clarify:
Subagent2 assumes the shutdown button will be pressed independent of upstream events, aka also independent of whether the shutdown button got removed. It would not veto against the removal of the shutdown problem (right?). The hope is that subagent1 doesn't bother to remove the shutdown button because it believes it won't be pressed anyway.
(The usefulness of subagent2 comes from it preventing the creation of pure u1 maximizers, it does not help protecting the shutdown button (right?).)
Question 1:
I may be wrong about this, but iirc MIRI considered a case where the agent assigned probability 0 to the shutdown button being pressed, and iirc the problem was that the agent would use the button as an outcome pump by creating mechanisms that would cause the button to be pressed if something went worse than expected, thereby magically decreasing the probability that something goes badly, since the button cannot be pressed.
I haven't thought much about it, but doesn't this proposal have the same failure mode? (And if not, why not?)
Question 2:
(Feel free to ignore this question if the proposal does fall for the failure mode in question1.)
Is there a probabilistic extension of the proposal which uses 99.99% instead of 100%, but still basically works. (Maybe assuming away some problems with comparing utility functions or so.) And if not: Why not / What is the part that relies on certainty?
Subagent2 assumes the shutdown button will be pressed independent of upstream events, aka also independent of whether the shutdown button got removed. It would not veto against the removal of the shutdown problem (right?). The hope is that subagent1 doesn't bother to remove the shutdown button because it believes it won't be pressed anyway.
(The usefulness of subagent2 comes from it preventing the creation of pure u1 maximizers, it does not help protecting the shutdown button (right?).)
That's correct.
I may be wrong about this, but iirc MIRI considered a case where the agent assigned probability 0 to the shutdown button being pressed, and iirc the problem was that the agent would use the button as an outcome pump by creating mechanisms that would cause the button to be pressed if something went worse than expected, thereby magically decreasing the probability that something goes badly, since the button cannot be pressed.
I haven't thought much about it, but doesn't this proposal have the same failure mode? (And if not, why not?)
You remember correctly. This proposal avoids that failure mode via the do()-ops (More generally, that failure mode is an evidential-decision-theory sort of problem, which can be avoided by using causal-style decision theories, as this proposal does.) In particular, note that such outcome pumps require "mechanisms which would cause the button to be pressed"; because of the do()-ops, the agent acts-as-though there is no mechanism which would cause the button to be pressed.
Of course that means this proposal likely has the usual failure modes of causal decision theories (e.g. in Newcomb-style problems); presumably the way to handle that is via a functional-decision-theory style approach, i.e. counterfacting over all simulated instances of the shutdown button.
Thanks.
I briefly looked into the MIRI paper (and the section from Eliezer's lecture that starts at 22min) again.
My main guess now is that you're not supposed to make the agent act on false beliefs; that one of the desiderada of the shutdown problem is that your agent shouldn't have a capability penalty, including taking bad actions based on false beliefs (like taking bad bets, as jessicata mentioned). Your proposal violates that.
The case MIRI considered wasn't to modify the world model so the agent believes that the button cannot be pressed, but instead to construct the utility function in a way that the resulting behavior ends up equivalent, but without the agent having wrong beliefs.
Then there’s the problem of designing the negotiation infrastructure, and in particular allocating bargaining power to the various subagents. They all get a veto, but that still leaves a lot of degrees of freedom in exactly how much the agent pursues the goals of each subagent. For the shutdown use-case, we probably want to allocate most of the bargaining power to the non-shutdown subagent, so that we can see what the system does when mostly optimizing for u_1 (while maintaining the option of shutting down later).
I don't understand what you mean by "allocating bargaining power", given already each agent has true veto power. Regardless of the negotiation mechanism you set up for them (if it's high-bandwidth enough), or whether the master agent says "I'd like this or that agent to have more power", each subagent could go "give me my proportional (1/n) part of the slice, or else I will veto everything" (and depending on its prior about how other agents could respond, this will seem net-positive to do).
In fact that's just the tip of the iceberg of individually rational game-theoretic stuff (that messes with your proposal) they could pull off, see Commitment Races.
Brain-storming fixes:
The natural generalization is then to have one subagent for each time at which the button could first be pressed (including one for “button is never pressed”, i.e. the button is first pressed at ). So subagent maximizes E[ | do( = unpressed), observations], and for all other times subagent T maximizes E[ | do( = unpressed, = pressed), observations]. The same arguments from above then carry over, as do the shortcomings (discussed in the next section).
Can you explain how this relates to Elliot Thornley's proposal? It's pattern matching in my brain but I don't know the technical details.
I don't know. Thornley's proposal got me thinking about subagents as a tool for corrigibility, but I never understood his properties well enough to say how his subagents relate to the counterfactual-optimizing agents in this proposal.
Interesting idea. Couple of comments.
(1) Your proposal requires each subagent to be very mistaken about the probability of shutdown at each timestep. That seems like a drawback. Maybe it's hard to ensure that subagents are so mistaken. Maybe this mistake would screw up subagents' beliefs in other ways.
(2) Will subagents' veto-power prevent the agent from making any kind of long-term investment?
Consider an example. Suppose that we can represent the extent to which the agent achieves its goals at each timestep with a real number ('utilities'). Represent trajectories with vectors of utilities. Suppose that, conditional on no-shutdown, the Default action gives utility-vector . The other available action is 'Invest'. Conditional on no-shutdown, Invest gives utility-vector .
As long as the agent's goals aren't too misaligned with our own goals (and as long as the true probability of an early shutdown is sufficiently small), we'll want the agent to choose Invest (because Invest is slightly worse than the default action in the short-term but much better in the long-term). But Subagent2 will veto choosing Invest, because Subagent2 is sure that shutdown will occur at timestep 2, and so from its perspective, Invest gives shutdown whereas the default action gives shutdown.
Is that right?
Re: (2), that depends heavily on how the "shutdown utility function" handles those numbers. An "invest" action which costs 1 utility for subagent 1, and yields 10 utility for subagent 1 in each subsequent step, may have totally unrelated utilities for subagent 2. The subagents have different utility functions, and we don't have many constraints on the relationship between them.
Re: (1), yup, agreed.
The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:
- Does not want to manipulate the shutdown button
- Does respond to the shutdown button
- Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.
From a quick read, your proposal seems closely related to Jessica Taylor's causal-counterfactual utility indifference. Ryan Carey and I also recently had a paper formalising some similar ideas, with some further literature review https://arxiv.org/abs/2305.19861
I think you're right that the central problems remaining are in the ontological cluster, as well as the theory-practice gap of making an agent that doesn't override its hard-coded false beliefs.
But less centrally, I think one issue with the proposal is that the sub-agents need to continue operating in worlds where they believe in a logical contradiction. How does this work? (I think this is something I'm confused about for all agents and this proposal just brings it to the surface more than usual).
Also, agent1 and agent2 combine into some kind of machine. This machine isn't VNM rational. I want to be able to describe this machine properly. Pattern matching, my guess is that it violates independence in the same way as here. [Edit: Definitely violates independence, because the combined machine should choose a lottery over <button-pressed> over certainty of either outcome. I suspect that it doesn't have to violate any other axioms].
I think one issue with the proposal is that the sub-agents need to continue operating in worlds where they believe in a logical contradiction... I think this is something I'm confused about for all agents and this proposal just brings it to the surface more than usual
+1 to this. For the benefit of readers: the "weirdness" here is common to CDT agents in general. In some sense they're acting-as-though they believe in a do()-operated model, rather than their actual belief. Part of the answer is that the do()-op is actually part of the planning machinery, and part of the answer is Abram's CDT=EDT thing, but I haven't grokked the whole answer deeply enough yet to see how it carries over to this new use-case.
Definitely violates independence, because the combined machine should choose a lottery over <button-pressed> over certainty of either outcome.
Assuming I'm interpreting you correctly, this is non-obvious, because the lottery-choice will be one of many things the two agents negotiate over. So it could be that the negotiation shakes out to the certainty option, with some other considerations counterbalancing elsewhere in the negotiation.
More generally, insofar as the argument in Why Not Subagents? generalizes, the subagents should aggregate into an expected utility maximizer of some sort. But the probabilities of the resulting agent don't necessarily match the epistemic probabilities of the original model - e.g. the agent's probability on button state mostly reflects the relative bargaining power of the subagents rather than an epistemic state.
You explicitly assume this stuff away, but I believe under this setup that the subagents would be incentivized to murder each other before the button is pressed (to get rid of that annoying veto).
I also note that if one agent becomes way way smarter than the other, that this balance may not work out.
Even if it works, I don't see how to set up the utility functions such that humans aren't disempowered. That's a complicated term!
Overall a very interesting idea.
Curated. It's nice to see a return to the problems of yore, and I think this is a nice incremental proposal. Bringing in causal counterfactuals seems like a neat trick (with lots of problems, as discussed in the post and the comments), and so does bringing in some bargaining theory.
I have lots of confusions and questions, like
so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenting
doesn't make sense to me yet, as it seems easy for the utility functions / belief states to all prefer killing humans quickly, even if the humans don't affect the shutdown button exactly. Or the aside on bargaining with non-causally-counterfacting agents. But they're confusions and questions that afford some mulling, which is pretty cool!
Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state.
I disagree.
Suppose there is a paperclip lottery, if the AI wins, it gets loads of clips, and it becomes harder to shut down.
The AI has been entered by someone else, not playing isn't an option.
Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.
To the paperclip AI, this is a good deal. The conditional expectation acts as a sort of average of all scenarios where the button isn't pressed. And this scenario has more clips than most, so increases the average.
For the shutdown AI, this is an unusually bad scenario, and taking it out of their expectation also increases conditional expected utility.
So both subagents agree on a plan that conditionally manipulates the button.
Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.
I'm not seeing how it is possible for the agents to imagine that? Both of them expect that they have no influence whatsoever over whether the button is pressed, because there's nothing in their decision-driving models which is causally upstream of the button.
Suppose these 3 scenarios are judged as equally likely.
A) Button pressed. U_1=U_2=0
B) Button not pressed. U_1=U_2=0
C) AI gets control of button. U_1=6, U_2=-6
Suppose both agents agree that in option C, they should press the button. (this is the only 1 bit of influence they have in this toy problem)
Then agent 1 has a "certainty" of B. E(U_1|not pressed)=E(U_1|B)=0
Agent 2 has remaining uncertainty E(U_2|pressed)=1/2(E(U_2|A)+E(U_2|C))=1/2(0+-6)=-3
Change it around, make AI decide not to press the button in case C, and
E(U_1|not pressed)=3, E(U_2|pressed)=0.
There is no "AI gets control of button" option, from the perspective of either subagent. Both agents would look at option c, stick it into their do()-op on button state, and then act-as-though option C would not give any control at all over the button state.
I think you are attempting to do the math as though the do()-op were instead just a standard conditional (i.e. EDT-style rather than CDT-style)?
Yes. I was assuming a standard conditional for the button.
I can't currently see any flaws with the CDT style. Other than.
I think this frames the problem in the wrong way. We don't actually want an AGI with two utility functions, shutdown plus some other goals like CEV.
We just want an AGI that does what we want all the time, including shutting down if that's what we currently want it to do.
The more I think about it, I think not only that Corrigibility or DWIM is an attractive primary goal for AGI, but it's way easier than anything like CEV. It's just way simpler to convey the concept of "do what this guy says" than all of human ethics - let alone a perfect set of ethics for all times. You can change your mind any time rather than having to state up front what you want for all time. For all of those reasons, I think this is what people will actually try for the first AGI alignment target.
This "just do what I say all the time" alignment goal also solves the shutdown problem. There's only one goal, so no need to resolve the conflict between two utility functions or two goals.
I think you are missing the goal of solving shutdown problem. If you solve shutdown problem, you can, with caveats, allow yourself to fail other alignment parts. You can say "shit, this superintelligence is not, actually, doing what I mean, and probably is going to kill me", shutdown it and try again, entering the realm of iterative design.
The reason why you want separate mechanism for shutdown is because separate mechanism makes failure of multiple mechanisms less likely. If you train LLM only to DWIM and suddenly it fails, it is likely to ignore your shutdown command too.
I don't think I'm missing the primary point, although I agree that you would want a separate shutdown mechanism in an ideal world. But the way MIRI posed the problem is that "you can't get an AGI to shut down and this is awful". But you can. They and Your Wentworth are making the problem hard by assuming an alignment goal that nobody is likely to try for in the real world. People are going to want an AGI that just does what they tell it to, rather than making galaxy-brained extrapolations about what they and perhaps all of humanity or all of future sentients "really want".
I think there's a real possibility that an AGI accepts but misundunderstands your complex requests but still accepts and understands the simple shutdown request. This allows you to perform more alignment work if something seems less than ideal in its behavior. That's also the point of having a shutdown mechanism in the MIRI formulation of the problem.
The issue is that, by default, an AGI is going to make galaxy-brained extrapolations in response to simple requests, whether you like that or not. It's simply part of figuring out what to do – translating its goals all around its world-model, propagating them up the abstraction levels, etc. Like a human's decision where to send job applications and how to word them is rooted in what career they'd like to pursue is rooted in their life goals is rooted in their understanding of where the world is heading.
To our minds, there's a natural cut-off point where that process goes from just understanding the request to engaging in alien moral philosophy. But that cut-off point isn't objective: it's based on a very complicated human prior of what counts as normal/sane and what's excessive. Mechanistically, every step from parsing the wording to solving philosophy is just a continuous extension of the previous ones.
"An AGI that just does what you tell it to" is a very specific design specification where we ensure that this galaxy-brained extrapolation process, which an AGI is definitely and convergently going to want to do, results in it concluding that it wants to faithfully execute that request.
Whether that happens because we've attained so much mastery of moral philosophy that we could predict this process' outcome from the inputs to it, or because we figured out how to cut the process short at the human-subjective point of sanity, or because we implemented some galaxy-brained scheme of our own like John's post is outlining, shouldn't matter, I think. Whatever has the best chance of working.
And I think somewhat-hacky hard-coded solutions have a better chance of working on the first try, than the sort of elegant solutions you're likely envisioning. Elegant solutions require a well-developed theory of value. Hacky stopgap measures only require to know which pieces of your software product you need to hobble. (Which isn't to say they require no theory. Certainly the current AI theory is so lacking we can't even hack any halfway-workable stopgaps. But they provide an avenue of reducing how much theory you need, and how confident in it you need to be.)
I agree with all of that.
In particular, the main point of this proposal is that it does not require any mastery of ethical philosophy, just the rough knowledge of what humans tend to mean by what they say that LLMs already have. I see this as more of a hacky stopgap rather than an elegant solution.
I think maybe I sound naive phrasing it as "the AGI should just do what we say", as though I've wandered in off the street and am proposing a "why not just..." alignment solution. I'm familiar with a lot of the arguments both for why corrigibility is impossible, and for why it's maybe not that hard. I believe Paul Christiano's use of corrigibility is similar to what I mean.
A better term than "just do what I tell it" is "do what I mean and check". I've tried to describe this in Corrigibility or DWIM is an attractive primary goal for AGI. Checking, or clarifying when it's uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function. But adding an explicit goal of checking when consequences are large or it's uncertain about intent is another pragmatic, relatively hard-coded (at least in my vision of language model agent alignment) that reduces the agent acting on its galaxy-brained extrapolation of what you meant.
Whether we can implement this is very dependent on what sort of AGI is actually built first. I think that's likely to be some variant of language model cognitive architecture, and I think we can rather easily implement it there. This isn't a certain alignment solution; I propose more layers here. This isn't the elegant, provable solution we'd like, but it seems to have a better chance of working than any other actual proposal I've seen.
I think maybe I sound naive phrasing it as "the AGI should just do what we say", as though I've wandered in off the street and am proposing a "why not just..." alignment solution
Nah, I recall your takes tend to be considerably more reasonable than that.
I agree that DWIM is probably a good target if we can specify it in a mathematically precise manner. But I don't agree that "rough knowledge of what humans tend to mean" is sufficient.
The concern is that the real world has a lot of structures that are unknown to us – fundamental physics, anthropics-like confusions regarding our place in everything-that-exists, timeless decision-theory weirdness, or highly abstract philosophical or social principles that we haven't figured out yet.
These structures might end up immediately relevant to whatever command we give, on the AI's better model of reality, in a way entirely unpredictable to us. For it to then actually do what we mean, in those conditions, is a much taller order.
For example, maybe it starts perceiving itself to be under an acausal attack by aliens, and then decide that the most faithful way to represent our request is to blow up the planet to spite the aliens. Almost certainly not literally that[1], but you get the idea. it may perceive something completely unexpected-to-us in the environment, and then its perception of that thing would interfere with its understanding of what we meant, even on requests that seem completely tame to us. The errors would then compound, resulting in a catastrophe.
The correct definition of DWIM would of course handle that. But a flawed, only-roughly-correct one? Each command we give would be rolling the dice on dying, with IMO pretty bad odds, and scaling exponentially with the command's complexity.
Checking, or clarifying when it's uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function
That doesn't work, though, if taken literally? I think what you're envisioning here is a solution to the hard problem of corrigibility, which – well, sure, that'd work.
My money's on our understanding of what we mean by "what we mean" being hopelessly confused, and that causing problems. Unless, again, we've figured out how to specify it in a mathematically precise manner – unless we know we're not confused.
I think that acausal attacks is kinda galaxy-brained example, I have better one. Imagine that you are training superintelligent programmer. It writes code, you evaluate it and analyse vulnerabilities in code. Reward is calculated based on quality metrics, including number of vulnerabilities. In some moment your model becomes sufficiently smart to notice that you don't see all vulnerabilities, because you are not superintelligence. I.e., in some moment ground-truth objective of training process becomes "produce code with vulnerabilities that only superintelligence can notice" instead of "produce code with no vulnerabilities", because you see code, think "wow, so good code with no vulnerabilies" and assign maximum reward, while actually code is filled with them.
Checking, or clarifying when it's uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function
That doesn't work, though, if taken literally? I think what you're envisioning here is a solution to the hard problem of corrigibility, which – well, sure, that'd work.
I think the bigger problem here is what happens when the agent ends up with an idea of "what we mean/intend" which is different from what we mean/intend, at which point the agent's method of checking will diverge from our intended methods of checking.
quetzal_rainbow's example is one case of that phenomenon.
I think the bigger problem here is what happens when the agent ends up with an idea of "what we mean/intend" which is different from what we mean/intend
Agreed; I did gesture at that in the footnote.
I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human "means" actually still has to route through defining how we want to handle moral philosophy/value extrapolation.
E. g., suppose the AGI's operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?
There's quite a lot of different ways by which you can slice the idea. There's probably a way that corresponds to the intuitive meaning of "do what I mean", but maybe there isn't, and in any case we don't yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what "DWIM" means doesn't solve anything.)
And then, because of the general "unknown-unknown environmental structures" plus "compounding errors" problems, picking the wrong definition probably kills everyone.
I think that assuming "if my approach fails it fails in convenient way" is not very favorable by Mr. Murphy line of reasoning absent some rigorous guarantees.
I take your point that it would be better to have a reliable shutdown switch that's entirely separate from your alignment scheme, so that if it fails completely you have a backup. I don't think that's possible for a full ASI, in agreement with MIRI's conclusion. It could be that Wentworth's proposal would work, I'm not sure. At the least, the inclusion of two negotiating internal agents would seem to impose a high alignment tax. So I'm offering an alternative approach.
I'm not assuming my approach can fail in only that one way, I'm saying it does cover that one failure mode. Which seems cover part of what you were asking for above:
You can say "shit, this superintelligence is not, actually, doing what I mean, and probably is going to kill me", shutdown it and try again, entering the realm of iterative design.
If the ASI has decided to just not do what I mean and to do somethingg else instead, then no it won't shutdown. Alignment has failed for technical reasons, not theoretical ones. That seems possible for any alignment scheme. But if it's misunderstood what I mean, or I didn't think through the consequences of what I mean well enough, than I get to iteratively design my request.
I’m confused. Let’s assume that the button probably isn’t pressed at the start. Seems quite likely that the first agent proposes building a sub-agent that maximally pursues its utility if the button is pressed in the first round and maximally pursues the second agents utility if it is not pressed in the first round. The second agent believes that the button is certainly not pressed during the first round, so it has no reason to negotiate further. If the button doesn’t get triggered in the first round (which it likely won’t if the sub-agents have had no time to do anything), it will forever after pursue the first agents utility.
The button play no role in this utility, so instrumental incentives mean it will destroy it sooner or later. This seems like it breaks the system.
Now, this isn’t the only equilibrium. The first agent believes the button will always be pressed, so it has no inventive to argue for the criteria being whether the button is pressed in the first round vs. needing to have been pressed in this round and all previous rounds. On the other, instead of balance, it seems likely that one agent or the other creates a subagent that clobbers the others utility, with that agent assuming that this only happens in a world that never occurs.
I think the initial (2-agent) model only has two time steps, ie one opportunity for the button to be pressed. The goal is just for the agent to be corrigible for this single button-press opportunity.
I don't think this approach is going to generalize to alignment because in order to detailedly control agents in this way, you need to give exponentially many agents veto power, which means that even a small probability of veto from an individual agent will lead to certainty of veto from some agent. That said, this plausibly solves the shutdown problem.
I understand that this is not the goal but I thought it would be relevant to consider anyway, if the hope is to build on top of this.
But for the continuous limit the subagents become similar to each other at the same rate as they become more numerous. It seems intuitive to me that with a little grinding you could get a decision-making procedure whose policy is an optimum of an integral over "subagents" who bet on the button being pushed at different times, and so the whole system will change behavior upon an arbitrarily-timed press of the button.
Except I think in continuous time you probably lose guarantees about the system not manipulating humans to press/not press the button. Unless maybe each subagent believes the button can only be pressed exactly at their chosen time. But this highlights that maybe all of these counterfactuals give rise to really weird worlds, that in turn will give rise to weird behavior.
I could buy something like this with the continuous time limit.
I just mean if you want to extend this to cover things outside of the shutdown problem. Like you might want to request the AI to build you a fusion power plant, or cook you a chocolate cake, or make a company that sells pottery, or similar. You could have some way of generating a utility function for each possibility, and then generate subagents for all of them, but if you do this you've got an exponentially large conjunction.
What happens if you shut down power to the AWS or Azure console powering the Foundation model? Wouldn't this be the easiest way to test various hypotheses associated with the Shutdown Problem in order to either verify it or reject it as a problem not worth sinking further resources into?
Moreover, you claimed that they wouldn't care about manipulating button state. But surely they care about the signal to their operating harness that relays the button state, and they don't have certainty about that. If they think the button state is certain but the signal isn't, they may want to manipulate that.
Ok, you say, let's just bake the signal about the button into the problem rather than the state of the button. Ah, but in that case, when the AIs agree to copy themselves together (as we saw they would), they don't copy over the part of the system that actually transmits the signal, since they're convinced the signal will always magically come. And now, for the copies, the signal becomes broken in reality, and the button does nothing for the copies.
Yup, the full agent (including the two subagents) is happy to copy itself. That is a feature, not a bug. The important thing is that the copy also respects the shutdown signal.
The agent will not manipulate the signal which relays the button state except to make it more accurate, because both subagents want an accurate reading of the button state. This is because both subagents act-as-though the button itself will definitely be in their favor, and therefore any possibility of an inaccurate reading is bad for them.
Are you aware of how Holtman solved MIRI's formulation of the shutdown problem in 2019? https://arxiv.org/abs/1908.01695, my summary notes at https://docs.google.com/document/u/0/d/1Tno_9A5oEqpr8AJJXfN5lVI9vlzmJxljGCe0SRX5HIw/mobilebasic
Skimming through your proposal, I believe Holtman's correctly-constructed utility function correction terms would work for the scenario you describe, but it's not immediately obvious how to apply them once you jump to a subagent model.
Hadn't seen that before. Based on an hour or so of sorting through it, I think it basically doesn't work.
The most legible problem I've been able to identify so far is that it will create shutdown-resistant successor agents (despite Holtman's claims to the contrary). The problem here is that, prior to the shutdown button being pressed, the agent acts-as-though it's just optimizing and expects to continue optimizing indefinitely; it has no incentive to maintain the option of switching to later, because at pre-press timesteps cancels out everything to do with and just makes the value function mimic the -maximizing value function. And an -maximizer has no reason to make its successor agents shutdownable.
(Holtman claims that section 10 proves that some "emergent incentive" prevents this. I haven't sorted through the notational mess enough to figure out where that proof goes wrong, but I pretty strongly expect there's a mistake there, with my modal guess being that he forgot to account for the term in which cancels out all the contributions to . My second-most-probable guess is that is doing something weird to make the section 10 proof work, but then it breaks things in the earlier sections where is used in place of .)
Also, two complaints at a meta level. First, this paper generally operates by repeatedly slapping on not-very-generalizable-looking patches, every time it finds a new failure mode. This is not how one ends up with robustly-generalizable solutions; this is how one ends up with solutions which predictably fail under previously-unconsidered conditions. Second, after coming up with some interesting proof, it is important to go back and distill out the core intuitive story that makes it work; the paper doesn't really do that. Typically, various errors and oversights and shortcomings of formulation become obvious once one does that distillation step.
Goal
First things first: this is not (yet) aimed at solving the whole corrigibility problem, or even the whole shutdown problem.
The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.
Both this proposal and the argument that it works are simple enough to be easy to extend. The main hope of this post is that other people will pick up the proposal and run with it, beyond the very simple toy problem considered here, to more complex problems (and more complex failure modes). Insofar as this proposal gets us past the previous barriers, we want to find out what the next barriers will be.
We will not be very formal here.
Setup
The agent takes actions in two timesteps. First, the agent takes an action (possibly given some observations). In between steps, a shutdown button is either pressed or not-pressed. Then, in the second timestep, the agent observes the button state (and possibly other things), then takes another action.
Note that each “action” could itself be a rather large composite of many subactions, like e.g. spinning up child-agents.
We’ll assume that we have two utility functions on hand: utility function u2 incentivizes shutdown, and utility function u1 provides non-shutdown behavior. Of course we don’t actually know how to construct two such utility functions; we’re using these assumptions because they’re basically the same assumptions used in the MIRI paper. (And MIRI presumably used these assumptions because they simplify the problem a lot, while still maintaining enough structure to demonstrate the barriers MIRI found.)
Very roughly speaking, we want the agent to follow utility function u1 until the shutdown button is pressed, and then follow u2. A little less roughly speaking, we want the agent to maintain optionality: if it were just to follow u1 until the shutdown button were pressed, then e.g. it might create u1-maximizing child agents which would resist shutdown, or more generally it might precommit in ways which decrease its ability to switch to u2 after the button is pressed, etc, because those things would all likely result in higher u1.
Also, of course, we want the agent to not manipulate the shutdown button, the humans deciding whether to press the button, etc.
Proposal
First key piece: we’ll use a subagent architecture. The AI itself contains two expected-utility-maximizing subagents: one which pursues u1, and one which pursues u2. Each timestep, both subagents must agree on which action the AI will take, otherwise the AI takes some “default” action. (Put differently: each agent has a veto, and if either vetoes then the AI takes the default action.)
Second key piece: each of the two subagents maximizes expected utility under a different counterfactual.
So conceptually:
We will assume that the AI internals include infrastructure for the two subagents to negotiate with each other, form binding contracts, etc. We will also assume that both subagents are strong idealized reasoners in their own right. Finally, we’ll assume that the default option is not the best-case outcome for either subagent.
Arguments
First claim: the two subagents bet all of their influence against each other on the outcome of the button press. In other words, subagent 1 agrees to follow subagent 2’s preferences if the button comes out pressed, and vice-versa. Why? In general, subagent 1 acts-as-though the button is 100% certain to come out unpressed, subagent 2 acts-as-though it’s 100% certain to come out pressed, so both act-as-though this is a great deal for them. Once the button-state is resolved, the subagent which was correct will fully control the agent’s actions - i.e. the agent will act fully according to u1 in timestep 2 if the button is unpressed, and fully according to u2 in timestep 2 if the button is pressed.
Second claim: in the first timestep, whatever action the agent takes must have at least as high expected value as the default action, under both u1 and u2. That means maintaining option value. For instance, the agent would presumably not create a child agent which just maximizes u1, because that would presumably do worse than default on u2, so subagent 2 would veto it. (Though the agent could create a child agent which just maximized u1, if somehow that did not do worse on u2, which seems reasonable.)
Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state. While this sounds intuitive, it is the least confident of the claims; it is not totally obvious that the agent won’t acquire some incentive to manipulate the button via e.g. bets with other external agents with non-counterfacted models.
Generalization
One simple first-step generalization: add more timesteps. Maybe we want an AI which runs for a while, and a shutdown button which could be pressed at any time.
The natural generalization is then to have one subagent for each time at which the button could first be pressed (including one for “button is never pressed”, i.e. the button is first pressed at t=∞). So subagent ∞ maximizes E[u1 | do(∀t:buttont = unpressed), observations], and for all other times subagent T maximizes E[u2 | do(∀t<T:buttont = unpressed, buttonT = pressed), observations]. The same arguments from above then carry over, as do the shortcomings (discussed in the next section).
This points towards a general approach to use this style of agent for subproblems of corrigibility.
We have some interface to our hopefully-corrigible agent - in this case the shutdown button. We want the agent to be controllable via that interface, i.e. the interface steers what the agent does and the agent generally doesn’t try to manipulate the interface-state or whatever’s causally upstream of the interface. The agent “just follows instructions” from the interface, so to speak.
To apply the same sort of approach as this shutdown proposal, we design the agent to consist of many subagents, each of which counterfacts on a different interface-outcome. For each different sequence of “instructions” which could come from the interface over time, there’s a different subagent with a different utility and a different counterfactual.
Even more generally: one hypothesis I associate with the Embedded Agents sequence is that some kind of corrigibility is a standard feature of embedded agents. Insofar as an agent is made of parts, those parts need to be corrigible in order for the agent as a whole to “not fight itself” internally. So this whole setup potentially points to a general feature of agents made of parts. For instance, I could imagine an argument along the lines of:
Shortcomings/Remaining Problems
First, there are various standard alignment problems which this proposal isn’t even attempting to address. Ontological issues are one large class of examples: how do we make sure that the subagents all agree on what “the button being pressed/unpressed” looks like, such that their bets on the outcome resolve properly? And how do we make sure that their notion of “button pressed/unpressed” matches the human operator’s notion, so that the button does what the human thinks it does? How do we make sure that the subagents’ (high-level abstract) counterfactual over the button state counterfacts the way we intend? How do we make sure that’s all stable under ontology shifts, as the subagents learn more?
This proposal is also not even attempting to make the agent actively maintain the function of the shutdown button or the human operators. Under this proposal, if the AI takes some actions which happen to kill the human operators as a side-effect, it doesn’t particularly care.
Then there’s the general class of “subagent break out” problems - i.e. how do we ensure that one or both subagents actually negotiate over control of the full agent’s actions, rather than hacking out and taking independent action in the world?
Another large class of problems we’ve assumed away are the design of the utility functions. How the heck do we make utility functions that do what we want? (Or non-utility-function goal structures, if we want to go even more general.) That brings in all the usual value alignment problems, though they’re somewhat easier than usual: we’re not necessarily aligning the goals to something as complex as human values. Corrigibility is the point of the proposal, so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenting”. That does, however, mean that the “shutdown” utility function/goal-structure definitely needs to work.
The proposal also adds a couple new problems. First and most important, there’s the choice of “default action”. We probably want the default action to be not-too-bad by the human designers’ values; the obvious choice is a “do nothing” action. But then, in order for the AI to do anything at all, the “shutdown” utility function must somehow be able to do better than the “do nothing” action. Otherwise, that subagent would just always veto and be quite happy doing nothing.
Then there’s the problem of designing the negotiation infrastructure, and in particular allocating bargaining power to the various subagents. They all get a veto, but that still leaves a lot of degrees of freedom in exactly how much the agent pursues the goals of each subagent. For the shutdown use-case, we probably want to allocate most of the bargaining power to the non-shutdown subagent, so that we can see what the system does when mostly optimizing for u_1 (while maintaining the option of shutting down later).
Finally, of course, there’s the usual problems of unknown unknowns and getting this all across the theory-practice gap.
Thankyou to @EJT and @Sami Petersen for discussion and proposals which fed into this.