Formalizing Objections against Surrogate Goals

Interesting report :) One quibble:

For one, our AIs can only use “things like SPI” if we actually formalize the approach

I don't see why this is the case. If it's possible for humans to start using things like SPI without a formalisation, why couldn't AIs too? (I agree it's more likely that we can get them to do so if we formalise it, though.)

[-]VojtaKovarik4yΩ110

Thanks for pointing this out :-). Indeed, my original formulation is false; I agree with the "more likely to work if we formalise it" formulation.

[-]Caspar Oesterheld4yΩ570

Not very important, but: Despite having spent a lot of time on formalizing SPIs, I have some sympathy for a view like the following:

> Yeah, surrogate goals / SPIs are great. But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI. If we do this, then AI will implement SPIs (or something even better) regardless of how well we understand them. And if we don't solve these issues, then it's hopeless to add SPIs manually. Furthermore, believing that surrogate goals / SPIs work (or, rather, make a big difference for bargaining outcomes) shouldn't change our behavior much (for the reasons discussed in Vojta's post).

On this view, it doesn't help substantially to understand / analyze SPIs formally.

But I think there are sufficiently many gaps in this argument to make the analysis worthwhile. For example, I think it's plausible that the effective use of SPIs hinges on subtle aspects of the design of an agent that we might not think much about if we don't understand SPIs sufficiently well.

[-]Ofer4yΩ220

Regarding the following part of the view that you commented on:

But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI.

Just wanted to add: It may be important to consider potential downside risks of such work. It may be important to be vigilant when working on certain topics in game theory and e.g. make certain binding commitments before investigating certain issues, because otherwise one might lose a commitment race in logical time. (I think this is a special case of a more general argument made in Multiverse-wide Cooperation via Correlated Decision Making about how it may be important to make certain commitments before discovering certain crucial considerations.)

[-]Caspar Oesterheld4yΩ780

Great to see more work on surrogate goals/SPIs!

>Personally, the author believes that SPI might “add up to normality” --- that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.

I'm a bit confused by this claim. To me it's a bit unclear what you mean by "adding up to normality". (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn't change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that are somehow equivalent to them)? Or are you claiming that B) they don't fundamentally change game-theoretic analysis (of any scenario/most scenarios)? Or C) are you saying they are irrelevant for AI v. AI interactions? Or D) that the invention of SPIs will not revolutionize human society, make peace in the middle east, ...) Some of the versions seem clearly false to me. (E.g., re C, even if you think that the requirements for the use of SPIs are rarely satisfied in practice, it's still easy to construct simple, somewhat plausible scenarios / assumptions (see our paper) under which SPIs do seem do matter substantially for game-theoretic analysis.) Some just aren't justified at all in your post. (E.g., re A1, you're saying that (like myself) you find this all confusing and hard to say.) And some are probably not contrary to what anyone else believes about surrogate goals / SPIs. (E.g., I don't know anyone who makes particularly broad or grandiose claims about the use of SPIs by humans.)

My other complaint is that in some places you state some claim X in a way that (to me) suggests that you think that Tobi Baumann or Vince and I (or whoever else is talking/writing about surrogate goals/SPIs) have suggested that X is false, when really Tobi, Vince and I are very much aware of X and have (although perhaps to an insufficient extent) stated X. Here are three instances of this (I think these are the only three), the first one being most significant.

The main objection of the post is that while adopting an SPI, the original players must keep a bunch of things (at least approximately) constant(/analogous to the no-SPI counterfactual) even when they have an incentive to change that thing, and they need to do this credibly (or, rather, make it credible that they aren't making any changes). You argue that this is often unrealistic. Well, the initial reaction of mine was: "Sure, I know these things!" (Relatedly: while I like the bandit v caravan example, this point can also be illustrated with any of the existing examples of SPIs and surrogate goals.) I also don't think the assumption is that unrealistic. It seems that one substantial part of your complaint is that besides instructing the representative/self-modifying the original player/principal can do other things about the threat (like advocating a ban on real or water guns). I agree that this is important. If in 20 years I instruct an AI to manage my resources, it would be problematic if in the meantime I make tons of decisions (e.g., about how to train my AI systems) differently based on my knowledge that I will use surrogate goals anyway. But it's easy to come up scenarios where this is not a problem. E.g., when an agent considers immediate self-modification, *all* her future decisions will be guided by the modified u.f. Or when the SPI is applied to some isolated interaction. When all is in the representative's hand, we only need to ensure that the *representative* always acts in whatever way the representative acts in the same way it would act in a world where SPIs aren't a thing.

And I don't think it's that difficult to come up with situations in which the latter thing can be comfortably achieved. Here is one scenario. Imagine the two of us play a particular game G with SPI G'. The way in which we play this is that we both send a lawyer to a meeting and then the lawyers play the game in some way. Then we could could mutually commit (by contract) to pay our lawyers in proportion to the utilities they obtain in G' (and to not make any additional payments to them). The lawyers at this point may know exactly what's going on (that we don't really care about water guns, and so on) -- but they are still incentivized to play the SPI game G' to the best of their ability. You might even beg your lawyer to never give in (or the like), but the lawyer is incentivized to ignore such pleas. (Obviously, there could still be various complications. If you hire the lawyer only for this specific interaction and you know how aggressive/hawkish different lawyers are (in terms of how they negotiate), you might be inclined to hire a more aggressive one with the SPI. But you might hire the lawyer you usually hire. And in practice I doubt that it'd be easy to figure out how hawkish different lawyers are.

Overall I'd have appreciated more detailed discussion of when this is realistic (or of why you think it rarely is realistic). I don't remember Tobi's posts very well, but our paper definitely doesn't spend much space on discussing these important questions.

On SPI selection, I think the point from Section 10 of our paper is quite important, especially in the kinds of games that inspired the creation of surrogate goals in the first place. I agree that in some games, the SPI selection problem is no easier than the equilibrium selection problem in the base game. But there are games where it does fundamentally change things because *any* SPI that cannot further be Pareto-improved upon drastically increases your utility from one of the outcomes.

Re the "Bargaining in SPI" section: For one, the proposal in Section 9 of our paper can still be used to eliminate the zeroes!

Also, the "Bargaining in SPI" and "SPI Selection" sections to me don't really seem like "objections". They are limitations. (In a similar way as "the small pox vaccine doesn't cure cancer" is useful info but not an objection to the small pox vaccine.)

[-]VojtaKovarik4yΩ220

My other complaint is that in some places you state some claim X in a way that (to me) suggests that you think that Tobi Baumann or Vince and I (or whoever else is talking/writing about surrogate goals/SPIs) have suggested that X is false, when really Tobi, Vince and I are very much aware of X and have (although perhaps to an insufficient extent) stated X.

Thank you for pointing that out. In all these cases, I actually know that you "stated X", so this is not an impression I wanted to create. I added a note at the begging of the document to hopefully clarify this.

[-]VojtaKovarik4yΩ110

Personally, the author believes that SPI might “add up to normality” --- that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.

I'm a bit confused by this claim. To me it's a bit unclear what you mean by "adding up to normality". (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn't change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that are somehow equivalent to them)? Or are you claiming that B) they don't fundamentally change game-theoretic analysis (of any scenario/most scenarios)? Or C) are you saying they are irrelevant for AI v. AI interactions? Or D) that the invention of SPIs will not revolutionize human society, make peace in the middle east, ...) Some of the versions seem clearly false to me. (E.g., re C, even if you think that the requirements for the use of SPIs are rarely satisfied in practice, it's still easy to construct simple, somewhat plausible scenarios / assumptions (see our paper) under which SPIs do seem do matter substantially for game-theoretic analysis.) Some just aren't justified at all in your post. (E.g., re A1, you're saying that (like myself) you find this all confusing and hard to say.) And some are probably not contrary to what anyone else believes about surrogate goals / SPIs. (E.g., I don't know anyone who makes particularly broad or grandiose claims about the use of SPIs by humans.)

I definitely don't think (C) and the "any" variant of (B). Less sure about the "most" variant of (B), but I wouldn't bet on that either.

I do believe (D), mostly because I don't think that humans will be able to make the necessary commitments (in the sense mentioned in the thread with Rohin). I am not super sure about (A). My bet is that to the extent that SPI can work for humans, we are already using it (or something equivalent) in most situations. But perhaps some exceptions will work, like the lawyer example? (Although I suspect that our skill at picking hawkish lawyers is stronger than we realize. Or there might be existing incentives where lawyers are being selected for hawkishness, because we are already using them for someting-like-SPI? Overall, I guess that the more one-time-only an event is, the higher is the chance that the pre-existing selection pressures will be weak, and (A) might work.)

Overall I'd have appreciated more detailed discussion of when this is realistic (or of why you think it rarely is realistic).

That is a good point. I will try to expand on it, perhaps at least in a comment here once I have time, or so :-).

[-]Rohin Shah4yΩ670

It seems like your main objection arises because you view SPI as an agreement between the two players. However, I don't see why (the informal version of) SPI requires you to verify that the other player is also following SPI -- it seems like one player can unilaterally employ SPI, and this is the version that seems most useful.

In the bandits example, it seems like the caravan can unilaterally employ SPI to reduce the badness of the bandit's threat. For example, the caravan can credibly commit that they will treat Nerf guns identically to regular guns, so that (a) any time one of them is shot with a Nerf gun, they will flop over and pretend to be a corpse, until the heist has been resolved, and (b) their probability of resisting against Nerf guns will be the same as the probability of resisting against actual guns. In this case the bandits might as well use Nerf guns (perhaps because they're cheaper, or they prefer not to murder if possible). If the bandits continue to use regular guns, the caravan isn't any worse off, so it is an SPI, despite the fact that we have assumed nothing on the part of the bandits.

It seems like this continues to be desirable for the caravan even in the repeated game setting, or the translucent game setting (indeed the translucent game setting makes it easier to make such a credible commitment).

(Aside: This technique can only work on "threats", i.e. actions that the other player is taking solely because you disvalue them, not because the other player values them. For example, the caravan can't employ SPI to reduce the threat of their goods being taken, but the bandits explicitly want the goods.)

On the topic of why humans don't use SPI, it seems like there are several reasons:

Humans and human institutions can't easily make credible commitments.
In the case of repeated games, it only takes one defection to break the SPI (e.g. in my bandits example above, it only takes one caravan that ignores the rule about Nerf guns being treated as regular guns, after which the bandits go back to regular guns), so the equilibrium is unstable. So you really need your commitments to be Very Credible.
Humans and human institutions are not particularly good at "acting" as though they have different goals.
In the modern world, credible threats are not that common (or at least public knowledge of them isn't common). We've chosen to remove the ability to make threats, rather than to mitigate their downsides.

[-]Ofer4yΩ550

In the bandits example, it seems like the caravan can unilaterally employ SPI to reduce the badness of the bandit's threat. For example, the caravan can credibly commit that they will treat Nerf guns identically to regular guns, so that (a) any time one of them is shot with a Nerf gun, they will flop over and pretend to be a corpse, until the heist has been resolved, and (b) their probability of resisting against Nerf guns will be the same as the probability of resisting against actual guns. In this case the bandits might as well use Nerf guns (perhaps because they're cheaper, or they prefer not to murder if possible). If the bandits continue to use regular guns, the caravan isn't any worse off, so it is an SPI, despite the fact that we have assumed nothing on the part of the bandits.

I agree that such a commitment can be employed unilaterally and can be very useful. Though the caravan should consider that doing so may increase the chance of them being attacked (due to the Nerf guns being cheaper etc.). So perhaps the optimal unilateral commitment is more complicated and involves a condition where the bandits are required to somehow make the Nerf gun attack almost as costly for themselves as a regular attack.

[-]Rohin Shah4yΩ460

Yeah, this seems right.

I'll note though that you may want to make it at least somewhat easier to make the new threat, so that the other player has an incentive to use the new threat rather than the old threat, in cases where they would have used the old threat (rather than being indifferent between the two).

This does mean it is no longer a Pareto improvement, but it still seems like this sort of unilateral commitment can significantly help in expectation.

[-]VojtaKovarik4yΩ220

It seems like your main objection arises because you view SPI as an agreement between the two players.

I would say that my main objection is that if you know that you will encounter SPI in situation X, you have an incentive to alter the policy that you will be using in X. Which might cause other agents to behave differently, possibly in ways that lead to the threat being carried out (which is precisely the thing that SPI aimed to avoid).

In the bandit case, suppose the caravan credibly commits to treating nerf guns identically to regular runs. And suppose this incentivizes the bandits to avoid regular guns. Then you are incentivized to self-modify to start resisting more. (EG, if you both use CDT and the "logical time" is "self modify?" --> "credibly commit?" --> "use nerf?" .) However, if the bandits realize this --- i.e., if the "logical time" is "use nerf?" --> "self modify?" --> "credibly commit?" --- then the bandits will want to not use nerf guns, forcing you to not self-modify. And if you each think that you are "logically before" the other party, you will make incompatible comitments (use regular guns & self-modify to resist) and people get shot with regular guns.

So, I agree that credible unilateral commitments can be useful and they can lead to guaranteed Pareto improvements. It's just that I don't think it addresses my main objection against the proposal.

So perhaps the optimal unilateral commitment is more complicated and involves a condition where the bandits are required to somehow make the Nerf gun attack almost as costly for themselves as a regular attack.

Yup, I fully agree.

[-]Rohin Shah4yΩ580

Then you are incentivized to self-modify to start resisting more.

The whole point is to commit not to do this (and then actually not do it)!

It seems like you are treating "me" (the caravan) as unable to make commitments and always taking local improvements; I don't know why you would assume this. Yes, there are some AI designs which would self-modify to resist more (e.g. CDT with a certain logical time); seems like you should conclude those AI designs are bad. If you were going to use SPI you'd build AIs that don't do this self-modification once they've decided to use SPI.

Do you also have this objection to one-boxing in Newcomb's problem? (Since once the boxes are filled you have an incentive to self-modify into a two-boxer?)

[-]VojtaKovarik4yΩ440

I think I agree with your claims about commiting, AI designs, and self-modifying into two-boxer being stupid. But I think we are using a different framing, or there is some misunderstanding about what my claim is. Let me try to rephrase it:

(1) I am not claiming that SPI will never work as intended (ie, get adopted, don't change players' strategies, don't change players' "meta strategies"). Rather, I am saying it will work in some settings and fail to work in others.

(2) Whether SPI works-as-intended depends on the larger context it appears in. (Some examples of settings are described in the "Embedding in a Larger Setting" section.) Importantly, this is much more about which real-world setting you apply SPI to than about the prediction being too sensitive to how you formalize things.

(3) Because of (2), I think this is a really tricky topic to talk about informally. I think it might be best to separately ask (a) Given some specific formalization of the meta-setting, what will the introduction of SPI do? and (b) Is it formalizing a reasonable real-world situation (and is the formalization appropriate)?

(4) An IMO imporant real-world situation is when you -- a human or an institution -- employ AI (or multiple AIs) which repeatedly interact on your behalf, in a world where bunch of other humans or institutions are doing the same. The details really matter here but I personally expect this to behave similarly to the "Evolutionary dynamics" example described in the report. Informally speaking, once SPI is getting widely adopted, you will have incentives to modify your AIs to be more aggressive / learn towards using AIs that are more aggresive. And even if you resist those incentives, you/your institution will get outcompeted by those who use more aggressive AIs. This will result in SPI not working as intended.

I originally thought the section "Illustrating our Main Objection: Unrealistic Framing" should suffice to explain all this, but apparently it isn't as clear as I thought it was. Nevertheless, it might perhaps be helpful to read it with this example in mind?

[-]Rohin Shah4yΩ340

I agree with (1) and (2), in the same way that I would agree that "one boxing will work in some settings and fail to work in others" and "whether you should one box depends on the larger context it appears in". I'd find it weird to call this an "objection" to one boxing though.

I don't really agree with the emphasis on formalization in (3), but I agree it makes sense to consider specific real-world situations.

(4) is helpful, I wasn't thinking of this situation. (I was thinking about cases in which you have a single agent AGI with a DSA that is tasked with pursuing humanity's values, which then may have to acausally coordinate with other civilizations in the multiverse, since that's the situation in which I've seen surrogate goals proposed.)

I think a weird aspect of the situation in your "evolutionary dynamics" scenario is that the bandits cannot distinguish between caravans that use SPI and ones that don't. I agree that if this is the case then you will often not see an advantage from SPI, because the bandits will need to take a strategy that is robust across whether or not the caravan is using SPI or not (and this is why you get outcompeted by those who use more aggressive AIs). However, I think that in the situation you describe in (4) it is totally possible for you to build a reputation as "that one institution (caravan) that uses SPI", allowing potential "bandits" to threaten you with "Nerf guns", while threatening other institutions with "real guns". In that case, the bandits' best response is to use Nerf guns against you and real guns against other institutions, and you will outcompete the other institutions. (Assuming the bandits still threaten everyone at a close-to-equal rate. You probably don't want to make it too easy for them to threaten you.)

Overall I think if your point is "applying SPI in real-world scenarios is tricky" I am in full agreement, but I wouldn't call this an "objection".

[-]VojtaKovarik4yΩ110

I agree with (1) and (2), in the same way that I would agree that "one boxing will work in some settings and fail to work in others" and "whether you should one box depends on the larger context it appears in". I'd find it weird to call this an "objection" to one boxing though.

I agree that (1)+(2) isn't significant enough to qualify as "an objection". I think that (3)+(4)+(my interpretation of it? or something?) further make me believe something like (2') below. And that seems like an objection to me.

(2') Whether or not it works-as-intended depends on the larger setting and there are many settings -- more than you might initially think -- where SPI will not work-as-intended.

The reason I think (4) is a big deal is that I don't think it relies on you being unable to distinguish the SPI and non-SPI caravans. What exactly do I mean by this? I view the caravans as having two parameters: (A) do they agree to using SPI? and, independently of this, (B) if bandits ambush & threaten them (using SPI or not), do they resist or do they give in? When I talk about incentives to be more agressive, I meant (B), not (A). That is, in the evolutionary setting, "you" (as the caravan-dispatcher / caravan-parameter-setter) will always want to tell the caravans to use SPI. But if most of the bandits also use SPI, you will want to set the (B) parameter to "always resist".

I would say that (4) relies on the bandits not being able to distinguish whether your caravan uses a different (B)-parameter from the one it would use in a world where nobody invented SPI. But this assumption seems pretty realistic? (At least if humans are involved. I agree this might be less of an issue in the AGI-with-DSA scenario.)

[-]Rohin Shah4yΩ330

But if most of the bandits also use SPI, you will want to set the (B) parameter to "always resist".

Can you explain this? I'm currently parsing it as "If most of the bandits would demand unarmed conditional on the caravan having committed to treating demand unarmed as they would have treated demand armed, then you want to make the caravan always choose resist", but that seems false so I suspect I have misunderstood you.

[-]VojtaKovarik4yΩ330

That is -- I think* -- a correct way to parse it. But I don't think it false... uhm, that makes me genuinely confused. Let me try to re-rephrase, see if uncovers the crux :-).

You are in a world where most (1- Ɛ ) of the bandits demand unarmed when paired with a caravan commited to [responding to demand unarmed the same as it responds to demand armed] (and they demand armed against caravans without such commitment). The bandit population (ie, their strategies) either remains the same (for simplicity) or the strategies that led to more profit increase in relative frequency. And you have a commited caravan. If you instruct it to always resist, you get payoff 9(1-Ɛ) - 2Ɛ (using the payoff matrix from "G'; extension of G"). If you instruct it to always give in, you get payoff 4(1- Ɛ ) + 3Ɛ. So it is better to instruct it to always resist.

*The only issue that comes to mind is my [responding to demand unarmed the same as it responds to demand armed] vs your [treating demand unarmed as they would have treated demand armed]? If you think there is a difference between the two, then I endorse the former and I am confused about what the latter would mean.

[-]Rohin Shah4yΩ560

Maybe the difference is between "respond" and "treat".

It seems like you are saying that the caravan commits to "[responding to demand unarmed the same as it responds to demand armed]" but doesn't commit to "[and I will choose my probability of resisting based on the equilibrium value in the world where you always demand armed if you demand at all]". In contrast, I include that as part of the commitment.

When I say "[treating demand unarmed the same as it treats demand armed]" I mean "you first determine the correct policy to follow if there was only demand armed, then you follow that policy even if the bandits demand unarmed"; you cannot then later "instruct it to always resist" under this formulation (otherwise you have changed the policy and are not using SPI-as-I'm-thinking-of-it).

I think the bandits should only accept commitments of the second type as a reason to demand unarmed, and not accept commitments of the first type, precisely because with commitments of the first type the bandits will be worse off if they demand unarmed than if they had demanded armed. That's why I'm primarily thinking of the second type.

[-]VojtaKovarik4yΩ440

Perfect, that is indeed the diffeence. I agree with all of what you write here.

In this light, the reason for my objection is that I understand how we can make a commitment of the first type, but I have no clue how to make a commitment of the second type. (In our specific example, once demand unarmed is an option -- once SPI is in use -- the counterfactual world where there is only demand armed just seems so different. Wouldn't history need to go very differently? Perhaps it wouldn't even be clear what "you" is in that world?)

But I agree that with SDA-AGIs, the second type of commitment becomes more realistic. (Although, the potential line of thinking mentioned by Caspar applies here: Perhaps those AGIs will come up with SPI-or-something on their own, so there is less value in thinking about this type of SPI now.)

[-]Rohin Shah4yΩ330

Yeah, I agree with all of that.

[-]Xodarap3y20

I think humans actually do use SPI pretty frequently, if I understand correctly. Some examples:

Pre-committing to resolving disputes through arbitration instead of the normal judiciary process. In theory at least, this results in an isomorphic "game", but with lower legal costs, thereby constituting a Pareto improvement.
Ritualized aggression: Directly analogous to the Nerf gun example. E.g. a bear will "commit" to giving up its territory to another bear who can roar louder, without the need of them actually fighting, which would be costly for both parties.
1. This example is maybe especially interesting because it implies that SPIs are (easily?) discoverable by "blind optimization" processes like evolution, and don't need some sort of intelligence or social context.
2. And it also gives an example of your point about not needing to know that the other party has committed to the SPI - bears presumably don't have the ability to credibly commit, but the SPI still (usually) goes through

[-]VojtaKovarik4yΩ110

Humans and human institutions can't easily make credible commitments.

That seems right. (Perhaps with the exception of legal contracts, unless one of the parties is powerful enough to make the contract difficult to enforce.) And even when individual people in an institution have powerful commitment mechanisms, this is not the same as the institution being able to credible commit. For example, suppose you have a head of a state that threatens suicidal war unless X happens, and they are stubborn enough to follow up on it. Then if X happens, you might get a coup instead, thus avoiding the war.

[-]VojtaKovarik4yΩ230

the tl;dr version of the full report:

The surrogate goals (SG) idea proposes that an agent might adopt a new seemingly meaningless goal (such as preventing the existence of a sphere of platinum with a diameter of exactly 42.82cm or really hating being shot by a water gun) to prevent the realization of threats against some goals they actually value (such as staying alive) [TB1, TB2]. If they can commit to treating threats to this goal as seriously as threats to their actual goals, the hope is that the new goal gets threatened instead. In particular, the purpose of this proposal is not to become more resistant to threats. Rather, we hope that if the agent and the threatener misjudge each other (underestimating the commitment to ignore/carry out the threat), the outcome (Ignore threat, Carry out threat) will be replaced by something harmless.

Safe Pareto improvements (SPIs) [OC] is a formalization and a generalization of this idea. In the straightforward interpretation, the approach applies to a situation where an agent delegates a problem to their representative --- but we could also imagine “delegating” to a future self-modified version of oneself. The idea is to give instructions like “if other agents you encounter also have this same instruction, replace ‘real’ conflicts with them with ‘mock’ conflicts, in which you all act like you would in a real conflict, but bad outcomes are replaced by less harmful variants”. Some potential examples would be fighting (should a fight arise) to the first blood instead of to the death or threatening diplomatic conflict instead of a military one. In the SG setting, SPI could correspond to a joint commitment to only use water-gun threats while behaving as if the water gun was real.

First, I have one “methodological” observation: If we are to understand whether these approaches “work”, it is not sufficient to look just at the threat situation itself. Instead, we need to embed it in a larger context. (Is the interaction one-time-only, or will there be repetitions? Did the agents have a chance to update on the fact that SG/SPI are being used? What do the agents know about each other?) In the report, I show that different settings can lead to (very) different outcomes.

Incidentally, I think this observation applies to most technical research. We should specify the larger context, at least informally, such that we can answer questions like “but will it help?” and “wouldn’t this other approach be even better?”.

Second, SG and SPI seem to have one limitation that seems hard to overcome. Both SG and SPI hope to make agents treat conflict as seriously as before while in-fact making conflict outcomes less costly. However, if SG/SPI starts “being a thing”, agents will have incentives to treat conflict as less costly. (EG, ignoring threats they might take seriously before or making threats they wouldn’t dare to make otherwise.) The report gives an example of an “evolutionary” setting that illustrates this well, imo.

Finally: (1) I believe that the hypothesis that “we ‘only’ need to work out the details of SG/SPI, and then we will avoid the realization of most threats” is false. (2) Instead, I would expect that the approach “adds up to normality” --- that with SG/SPI, we can do mostly the same things that we could do with “just” the ability to make contracts and being somewhat transparent to each other. (3) However, I still think that studying the approach is useful --- it is a way of formalizing things, it will work in some settings, and even when it doesn’t, we might get ideas for which other things could work instead.

G; Bandit game	Demand	Leave alone
Give in	3, 3	9, 0
Resist	-2, -2	9, 0

G’; extension of G	Demand (armed)	Leave alone	Demand unarmed
Give in (armed only)	3, 3	9, 0	9, -1
Resist	-2, -2	9, 0	9, -1
Give in always	3, 3	9, 0	4, 4

; SPI on G’	Demand unarmed	Leave alone
Give in always	3, 3	9, 0
Resist	-2, -2	9, 0

M’; extension of M	don’t use SPI	use SPI	opt-out
Instruct to give in	3, 3	4, 4	9, 1
Instruct to resist	-2, -2	9, -1	9, 1
opt-out	0, -1	0, -1	0, 1

M_a; alternative to M’	don’t use SPI	use SPI	opt-out
Instruct to give in	3, 3	4, 4	-1, 1
Instruct to resist	-2, -2	9, -1	-1, 1
opt-out	0, -1	0, -1	0, 1

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

16

Formalizing Objections against Surrogate Goals

16

Ω 10

16

Ω 10

Introduction

Framing and Assumptions

SPI: Summary Attempt

Non-canonical Interpretation

Potential Objections to SPI

Illustrating our Main Objection: Unrealistic Framing

The Basic Setup

The Meta-Game

Embedding in a Larger Setting

The “Other” Objections

SPI and Bargaining

SPI Selection

Bargaining in SPI

Imperfect Information

If SPI Works, Why Aren’t Humans Already Using It?

Empowering Bad Actors

Questions & Suggestions for Future Research

Conclusion

Footnotes

perf. coord. SPI on WD	Wingman	Cool guy
Wingman	1, 1	0, 3
Cool guy	3, 0	$λ \cdot (W, C) + (1 - λ) \cdot (C, W)$ but seems like 0, 0

	$y_{1}$	$y_{2}$	$y_{3}$
$x_{1}$	$x_{1}, y_{1}$	0, 0	0, 0
$x_{2}$	0, 0	$x_{2}, y_{2}$	0, 0
$x_{3}$	0, 0	0, 0	$x_{3}, y_{3}$