Formalizing Objections against Surrogate Goals

Not very important, but: Despite having spent a lot of time on formalizing SPIs, I have some sympathy for a view like the following:

> Yeah, surrogate goals / SPIs are great. But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI. If we do this, then AI will implement SPIs (or something even better) regardless of how well we understand them. And if we don't solve these issues, then it's hopeless to add SPIs manually. Furthermore, believing that surrogate goals / SPIs work (or, rather, make a big difference for bargaining outcomes) shouldn't change our behavior much (for the reasons discussed in Vojta's post).

On this view, it doesn't help substantially to understand / analyze SPIs formally.

But I think there are sufficiently many gaps in this argument to make the analysis worthwhile. For example, I think it's plausible that the effective use of SPIs hinges on subtle aspects of the design of an agent that we might not think much about if we don't understand SPIs sufficiently well.

Formalizing Objections against Surrogate Goals

Great to see more work on surrogate goals/SPIs!

>Personally, the author believes that SPI might “add up to normality” --- that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.

I'm a bit confused by this claim. To me it's a bit unclear what you mean by "adding up to normality". (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn't change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that are somehow equivalent to them)? Or are you claiming that B) they don't fundamentally change game-theoretic analysis (of any scenario/most scenarios)? Or C) are you saying they are irrelevant for AI v. AI interactions? Or D) that the invention of SPIs will not revolutionize human society, make peace in the middle east, ...) Some of the versions seem clearly false to me. (E.g., re C, even if you think that the requirements for the use of SPIs are rarely satisfied in practice, it's still easy to construct simple, somewhat plausible scenarios / assumptions (see our paper) under which SPIs do seem do matter substantially for game-theoretic analysis.) Some just aren't justified at all in your post. (E.g., re A1, you're saying that (like myself) you find this all confusing and hard to say.) And some are probably not contrary to what anyone else believes about surrogate goals / SPIs. (E.g., I don't know anyone who makes particularly broad or grandiose claims about the use of SPIs by humans.)

My other complaint is that in some places you state some claim X in a way that (to me) suggests that you think that Tobi Baumann or Vince and I (or whoever else is talking/writing about surrogate goals/SPIs) have suggested that X is false, when really Tobi, Vince and I are very much aware of X and have (although perhaps to an insufficient extent) stated X. Here are three instances of this (I think these are the only three), the first one being most significant.

The main objection of the post is that while adopting an SPI, the original players must keep a bunch of things (at least approximately) constant(/analogous to the no-SPI counterfactual) even when they have an incentive to change that thing, and they need to do this credibly (or, rather, make it credible that they aren't making any changes). You argue that this is often unrealistic. Well, the initial reaction of mine was: "Sure, I know these things!" (Relatedly: while I like the bandit v caravan example, this point can also be illustrated with any of the existing examples of SPIs and surrogate goals.) I also don't think the assumption is that unrealistic. It seems that one substantial part of your complaint is that besides instructing the representative/self-modifying the original player/principal can do other things about the threat (like advocating a ban on real or water guns). I agree that this is important. If in 20 years I instruct an AI to manage my resources, it would be problematic if in the meantime I make tons of decisions (e.g., about how to train my AI systems) differently based on my knowledge that I will use surrogate goals anyway. But it's easy to come up scenarios where this is not a problem. E.g., when an agent considers immediate self-modification, *all* her future decisions will be guided by the modified u.f. Or when the SPI is applied to some isolated interaction. When all is in the representative's hand, we only need to ensure that the *representative* always acts in whatever way the representative acts in the same way it would act in a world where SPIs aren't a thing.

And I don't think it's that difficult to come up with situations in which the latter thing can be comfortably achieved. Here is one scenario. Imagine the two of us play a particular game G with SPI G'. The way in which we play this is that we both send a lawyer to a meeting and then the lawyers play the game in some way. Then we could could mutually commit (by contract) to pay our lawyers in proportion to the utilities they obtain in G' (and to not make any additional payments to them). The lawyers at this point may know exactly what's going on (that we don't really care about water guns, and so on) -- but they are still incentivized to play the SPI game G' to the best of their ability. You might even beg your lawyer to never give in (or the like), but the lawyer is incentivized to ignore such pleas. (Obviously, there could still be various complications. If you hire the lawyer only for this specific interaction and you know how aggressive/hawkish different lawyers are (in terms of how they negotiate), you might be inclined to hire a more aggressive one with the SPI. But you might hire the lawyer you usually hire. And in practice I doubt that it'd be easy to figure out how hawkish different lawyers are.

Overall I'd have appreciated more detailed discussion of when this is realistic (or of why you think it rarely is realistic). I don't remember Tobi's posts very well, but our paper definitely doesn't spend much space on discussing these important questions.

On SPI selection, I think the point from Section 10 of our paper is quite important, especially in the kinds of games that inspired the creation of surrogate goals in the first place. I agree that in some games, the SPI selection problem is no easier than the equilibrium selection problem in the base game. But there are games where it does fundamentally change things because *any* SPI that cannot further be Pareto-improved upon drastically increases your utility from one of the outcomes.

Re the "Bargaining in SPI" section: For one, the proposal in Section 9 of our paper can still be used to eliminate the zeroes!

Also, the "Bargaining in SPI" and "SPI Selection" sections to me don't really seem like "objections". They are limitations. (In a similar way as "the small pox vaccine doesn't cure cancer" is useful info but not an objection to the small pox vaccine.)

Can you control the past?

Nice post! As you can probably imagine, I agree with most of the stuff here.

>VII. Identity crises are no defense of CDT

On 1 and 2: This is true, but I'm not sure one-boxing / EDT alone solves this problem. I haven't thought much about selfish agents in general, though.

Random references that might be of interest:

>V. Monopoly money

As far as I can tell, this kind of point was first made on p. 108 here:

Gardner, Martin (1973). “Free will revisited, with a mind-bending prediction paradox by William Newcomb”. In: Scientific American 229.1, pp. 104–109.


>the so-called “Tickle Defense” of EDT.

I have my own introduction to the tickle defense, aimed more at people in this community than at philosophers:

>Finally, consider a version of Newcomb’s problem in which both boxes are transparent

There's a lot of discussion of this in the philosophical literature. From what I can tell, the case was first proposed in Sect. 10 of:

Gibbard, Allan and William L. Harper (1981). “Counterfactuals and Two Kinds of Expected Utility”. In: Ifs. Conditionals, Belief, Decision, Chance and Time. Ed. by William L. Harper, Robert Stalnaker, and Glenn Pearce. Vol. 15. The University of Western Ontario Series in Philosophy of Science. A Series of Books in Philosophy of Science, Methodology, Epistemology, Logic, History of Science, and Related Fields. Springer, pp. 153–190. doi: 10.1007/978-94-009-9117-0_8

>There is a certain broad class of decision theories, a number of which are associated with the Machine Intelligence Research Institute (MIRI), that put resolving this type of inconsistency in favor of something like “the policy you would’ve wanted to adopt” at center stage.

Another academic, early discussion of updatelessness is:

Gauthier, David (1989). “In the Neighbourhood of the Newcomb-Predictor (Reflections on Rationality)”. In: Proceedings of the Aristotelian Society, New Series, 1988–1989. Vol. 89, pp. 179–194.

Extracting Money from Causal Decision Theorists

Sorry for taking some time to reply!

>You might wonder why am I spouting a bunch of wrong things in an unsuccessful attempt to attack your paper.

Nah, I'm a frequent spouter of wrong things myself, so I'm not too surprised when other people make errors, especially when the stakes are low, etc.

Re 1,2: I guess a lot of this comes down to convention. People have found that one can productively discuss these things without always giving the formal models (in part because people in the field know how to translate everything into formal models). That said, if you want mathematical models of CDT and Newcomb-like decision problems, you can check the Savage or Jeffrey-Bolker formalizations. See, for example, the first few chapters of Arif Ahmed's book, "Evidence, Decision and Causality". Similarly, people in decision theory (and game theory) usually don't specify what is common knowledge, because usually it is assumed (implicitly) that the entire problem description is common knowledge / known to the agent (Buyer). (Since this is decision and not game theory, it's not quite clear what "common knowledge" means. But presumably to achieve 75% accuracy on the prediction, the seller needs to know that the buyer understands the problem...)

3: Yeah, *there exist* agent models under which everything becomes inconsistent, though IMO this just shows these agent models to be unimplementable. For example, take the problem description from my previous reply (where Seller just runs an exact copy of Buyer's source code). Now assume that Buyer knows his source code and is logically omniscient. Then Buyer knows what his source code chooses and therefore knows the option that Seller is 75% likely to predict. So he will take the other option. But of course, this is a contradiction. As you'll know, this is a pretty typical logical paradox of self-reference. But to me it just says that this logical omniscience assumption about the buyer is implausible and that we should consider agents who aren't logically omniscient. Fortunately, CDT doesn't assume knowledge of its own source code and such.

Perhaps one thing to help sell the plausibility of this working: For the purpose of the paper, the assumption that Buyer uses CDT in this scenario is pretty weak, formally simple and doesn't have much to do with logic. It just says that the Buyer assigns some probability distribution over box states (i.e., some distribution over the mutually exclusive and collectively exhaustive s1="money only in box 1", s2= "money only in box 2", s3="money in both boxes"); and that given such distribution, Buyer takes an action that maximizes (causal) expected utility. So you could forget agents for a second and just prove the formal claim that for all probability distributions over three states s1, s2, s3, it is for i=1 or i=2 (or both) the case that
(P(si)+P(s3))*$3 - $1 > 0.
I assume you don't find this strange/risky in terms of contradictions, but mathematically speaking, nothing more is really going on in the basic scenario.

The idea is that everyone agrees (hopefully) that orthodox CDT satisfies the assumption. (I.e., assigns some unconditional distribution, etc.) Of course, many CDTers would claim that CDT satisfies some *additional* assumptions, such as the probabilities being calibrated or "correct" in some other sense. But of course, if "A=>B", then "A and C => B". So adding assumptions cannot help the CDTer avoid the loss of money conclusion if they also accept the more basic assumptions. Of course, *some* added assumptions lead to contradictions. But that just means that they cannot be satisfied in the circumstances of this scenario if the more basic assumption is satisfied and if the premises of the Adversarial Offer help. So they would have to either adopt some non-orthodox CDT that doesn't satisfy the basic assumption or require that their agents cannot be copied/predicted. (Both of which I also discuss in the paper.)

>you assume that Buyer knows the probabilities that Seller assigned to Buyer's actions.

No, if this were the case, then I think you would indeed get contradictions, as you outline. So Buyer does *not* know what Seller's prediction is. (He only knows her prediction is 75% accurate.) If Buyer uses CDT, then of course he assigns some (unconditional) probabilities to what the predictions are, but of course the problem description implies that these predictions aren't particularly good. (Like: if he assigns 90% to the money in box 1, then it immediately follows that *no* money is in box 1.)

How to formalize predictors

As I mentioned elsewhere, I don't really understand...

>I think (1) is a poor formalization, because the game tree becomes unreasonably huge

What game tree? Why represent these decision problems as any kind of trees or game trees in particular? At least some problems of this type can be represented efficiently, using various methods to represent functions on the unit simplex (including decision trees)... Also: Is this decision-theoretically relevant? That is, are you saying, a good decision theory doesn't have to deal with 1 because it is cumbersome to write out (some) problems of this type? But *why* is this decision-theoretically relevant?

>some strategies of the predictor (like "fill the box unless the probability of two-boxing is exactly 1") leave no optimal strategy for the player.

Well, there are less radical ways of addressing this. E.g., expected utility-type theories just assign a preference order to the set of available actions. We could be content with that and accept that in some cases, there is no optimal action. As long as our decision theory ranks the available options in the right order... Or we could restrict attention to problems where an optimal strategy exists despite this dependence.

>And (3) seems like a poor formalization because it makes the predictor work too hard. Now it must predict all possible sources of randomness you might use, not just your internal decision-making.

For this reason, I always assume that predictors in my Newcomb-like problems are compensated appropriately and don't work on weekends! Seriously, though: what does "too hard" mean here? Is this just the point that it is in practice easy to construct agents that cannot be realistically predicted in this way when they don't want to be predicted? If so: I find that at least somewhat convincing, though I'd still be interested in developing theory that doesn't hinge on this ability.

Extracting Money from Causal Decision Theorists

On the more philosophical points. My position is perhaps similar to Daniel K's. But anyway...

Of course, I agree that problems that punish the agent for using a particular theory (or using float multiplication or feeling a little wistful or stuff like that) are "unfair"/"don't lead to interesting theory". (Perhaps more precisely, I don't think our theory needs to give algorithms that perform optimally in such problems in the way I want my theory to "perform optimally" Newcomb's problem. Maybe we should still expect our theory to say something about them, in the way that causal decision theorists feel like CDT has interesting/important/correct things to say about Newcomb's problem, despite Newcomb's problem being designed to (unfairly, as they allege) reward non-CDT agents.)

But I don't think these are particularly similar to problems with predictions of the agent's distribution over actions. The distribution over actions is behavioral, whereas performing floating point operations or whatever is not. When randomization is allowed, the subject of your choice is which distribution over actions you play. So to me, which distribution over actions you choose in a problem with randomization allowed, is just like the question of which action you take when randomization is not allowed. (Of course, if you randomize to determine which action's expected utility to calculate first, but this doesn't affect what you do in the end, then I'm fine with not allowing this to affect your utility, because it isn't behavioral.)

I also don't think this leads to uninteresting decision theory. But I don't know how to argue for this here, other than by saying that CDT, EDT, UDT, etc. don't really care whether they choose from/rank a set of distributions or a set of three discrete actions. I think ratificationism-type concepts are the only ones that break when allowing discontinuous dependence on the chosen distribution and I don't find these very plausible anyway.

To be honest, I don't understand the arguments against predicting distributions and predicting actions that you give in that post. I'll write a comment on this to that post.

Extracting Money from Causal Decision Theorists

Let's start with the technical question:

>Can your argument be extended to this case?

No, I don't think so. Take the class of problems. The agent can pick any distribution over actions. The final payoff is determined only as a function of the implemented action and some finite number of samples generated by Omega from that distribution. Note that the expectation is continuous in the distribution chosen. It can therefore be shown (using e.g. Kakutani's fixed-point theorem) that there is always at least one ratifiable distribution. See Theorem 3 at .

(Note that the above is assuming the agent maximizes expected vNM utility. If, e.g., the agent maximizes some lexical utility function, then the predictor can just take, say, two samples and if they differ use a punishment that is of a higher lexicality than the other rewards in the problem.)

Extracting Money from Causal Decision Theorists

Excellent - we should ask THEM about it.

Yes, that's the plan.

Some papers that express support for CDT:

In case you just want to know why I believe support for CDT/two-boxing to be wide-spread among academic philosophers, see , which is a survey of academic philosophers, where more people preferred two-boxing than one-boxing in Newcomb's problem, especially among philosophers with relevant expertise. Some philosophers have filled out this survey publicly, so you can e.g. go to , click on a name and then on "My Philosophical Views" to find individuals who endorse two-boxing. (I think there's also a way to download the raw data and thereby get a list of two-boxers.)

Extracting Money from Causal Decision Theorists

Note that while people on this forum mostly reject orthodox, two-boxing CDT, many academic philosophers favor CDT. I doubt that they would view this problem as out of CDT's scope, since it's pretty similar to Newcomb's problem.

How does this CDT agent reconcile a belief that the seller's prediction likelihood is different from the buyer's success likelihood?

Good question!

Extracting Money from Causal Decision Theorists

I agree with both of Daniel Kokotajlo's points (both of which we also make in the paper in Sections IV.1 and IV.2): Certainly for humans it's normal to not be able to randomize; and even if it was a primarily hypothetical situation without any obvious practical application, I'd still be interested in knowing how to deal with the absence of the ability to randomize.

Besides, as noted in my other comment insisting on the ability to randomize doesn't get you that far (cf. Sections IV.1 and IV.4 on Ratificationism): even if you always have access to some nuclear decay noise channel, your choice of whether to consult that channel (or of whether to factor the noise into your decision) is still deterministic. So you can set up scenarios where if you are punished for randomizing. In the particular case of the Adversarial Offer, the seller might remove all money from both boxes if she predicts the buyer to randomize.

The reason why our main scenario just assumes that randomization isn't possible is that our target of attack in this paper is primarily CDT, which is fine with not being allowed to randomize.

Load More