Thanks for writing this. I think this is a lot clearer and more accessible that most write-ups on this topic and seems valuable.
I think the points around randomly-sampled plans being lethal, and expecting AGI to more closely randomly-sample plans, seem off though:
I don't see why lethal plans dominate the simplicity-weighted distribution if all we do is condition on plans that succeed. I expect the reasoning is "Lethal IC plans are more likely to succeed, therefore there are more minor (equally or barely more complex) variations of a given lethal plan that succeed vs. minor variations of non-lethal plans, therefore the former will be overrepresented in the space of successful plans". But this doesn't seem to go through a priori. You get this "there are way more variations" phenomenon whenever the outcome is overdetermined by a plan, but this doesn't automatically make the plan more likely on a simplicity prior unless it's also not sufficiently more complex to outweigh this. In this case, a fully-fleshed out plan which goes all-in on IC and takes over the world might easily be more complex than a simpler plan, in which case why do we assume the IC plans dominate?
I don't think weighting by plan-complexity necessarily prioritises IC/lethal plans unless you also weight by something like "probability of plan success relative to a prior", in which case sure your distribution will upweight plans that just take over everything. But even so maybe simpler, non-lethal plans are likely enough to succeed that they still come out in front. It feels like what you're implicitly doing is assuming the AI will be trying to maximise the probability of WBE, but why would it do this? This seems to be where all the danger is coming from really. If it instead does something more like "Search through plans, pick the first one that seems "good enough"", then the question of whether it selects a dangerous plan is a purely empirical one about what its own inductive biases are, and it seems odd to be so a priori confident about the danger here
Want to bump this because it seems important? How do you see the agent in the post as being dominated?
Yeah I agree that even if they fall short of normative constraints there’s some empirical content around what happens in adversarial environments. I think I have doubts that this stuff translates to thinking about AGIs too much though, in the sense that there’s an obvious story of how an adversarial environment selected for (partial) coherence in us, but I don’t see the same kinds of selection pressures being a force on AGIs. Unless you assume that they’ll want to modify themselves in anticipation of adversarial environments which kinda begs the question
Kind of tangential but I'd be interested in your take on how strongly money-pumping etc is actually an argument against full-on cyclical preferences? One way to think about why getting money-pumped is bad is because you have an additional preference to not pay money to go nowhere. But it feels like all this tells us is that "something has to go", and if an agent is rationally permitted to modify its own preferences to avoid these situations then it seems a priori acceptable for it to instead just say something like "well actually I weight my cyclical preferences more highly so I'll modify the preference against arbitrarily paying"
In other words, it feels like the money-pumping arguments presume this other preference that in a sense takes "precedence" over the cyclical ones and I'm not sure how to think about that still
This seems totally different to the point OP is making which is that you can in theory have things that definitely are agents, definitely do have preferences, and are incoherent (hence not EV-maximisers) whilst not "predictably shooting themselves in the foot" as you claim must follow from this
I agree the framing of "there are no coherence theorems" is a bit needlessly strong/overly provocative in a sense, but I'm unclear what your actual objection is here - are you claiming these hypothetical agents are in fact still vulnerable to money-pumping? That they are in fact not possible?
Great post. I think a lot of the discussion around the role of coherence arguments and what we should expect a super-intelligent agent to behave like is really sloppy and I think this distinction between "coherence theorems as a self-contained mathematical result" and "coherence arguments as a normative claim about what an agent must be like on pain of shooting themselves in the foot" is an important one
The example of how an incomplete agent avoids getting Dutch-booked also seems to look very naturally like how irl agents behave imo. One way of thinking about this is also that these lotteries are a lot more "high-dimensional" than they initially look - e.g. the decision at node 2 isn't between "B and C" but between "B and C given I just chose B in a choice between B and A and this guy is trying to rip me off". In general the path-dependence of our bets and our meta-preferences on how our preferences are engaged with by other agents are also legitimate reasons to expect things like Dutch-booking has less normative force for actual agents IRL. Of course in a way this is maybe just making you VNM-rational after all albeit with a super weird and garbled utility function, but that's a whole other problem with coherence arguments
Ngl kinda confused how these points imply the post seems wrong, the bulk of this seems to be (1) a semantic quibble + (2) a disagreement on who has the burden of proof when it comes to arguing about the plausibility of coherence + (3) maybe just misunderstanding the point that's being made?
(1) I agree the title is a bit needlessly provocative and in one sense of course VNM/Savage etc count as coherence theorems. But the point is that there is another sense that people use "coherence theorem/argument" in this field which corresponds to something like "If you're not behaving like an EV-maximiser you're shooting yourself in the foot by your own lights", which is what brings in all the scary normativity and is what the OP is saying doesn't follow from any existing theorem unless you make a bunch of other assumptions
(2) The only real substantive objection to the content here seems to be "IMO completeness seems quite reasonable to me". Why? Having complete preferences seems like a pretty narrow target within the space of all partial orders you could have as your preference relation, so what's the reason why we should expect minds to steer towards this? Do humans have complete preferences?
(3) In some other comments you're saying that this post is straw-manning some extreme position because people who use coherence arguments already accept you could have e.g.
>an extremely powerful AI that is VNM rational in all situations except for one tiny thing that does not >matter or will never come up
This seems to be entirely missing the point/confused - OP isn't saying that agents can realistically get away with not being VNM-rational because its inconsistencies/incompletenesses aren't efficiently exploitable, they're saying that you can have an agent that aren't VNM-rational and aren't exploitable in principle - i.e., your example is an agent that could in theory be money-pumped by another sufficiently powerful agent that was able to steer the world to where their corner-case weirdness came out - the point being made about incompleteness here is that you can have a non VNM-rational agent that's not just un-Dutch-Bookable as a matter of empirical reality but in principle. The former still gets you claims like "A sufficiently smart agent will appear VNM-rational to you, they can't have any obvious public-facing failings", the latter undermines this
Ah I hadn't realised Caspar wrote that, thanks for the link! I agree that seems to be getting at the same idea, and it's kind of separable from the multi-agent point
I think (1b) doesn't go through. The "starting data" we have from (1a) is that the AGI has some preferences over lotteries that it competently acts on - acyclicality seems likely but we don't get completeness or transitivity for free, so we can't assume its preferences will be representable as maximising some utility function. (I suppose we also have the constraint that its preferences look "locally" good to us given training). But if this is all we have it doesn't follow that the agent will have some coherent goal it'd be want optimisers optimising towards
An AGI doesn't have to be an EU-maximiser to be scary - it could have e.g. incomplete preferences but still prefer B to A where we really really prefer A to B. But I think assuming an AI will look like an EU-maximiser does a lot of the heavy-lifting in guaranteeing the AGI will be lethal, since otherwise we can't a priori predict it'll want to optimise along any dimension particularly hard