I understand it’s a proposition like any other, I don’t see why an agent would reflect on it/use it in their deliberation to decide what to do. The fact that they’re a CDT agent is a fact about how they will act in the decision, not a fact that they need to use in their deliberation
Analogous to preferences, whether or not an agent prefers A or B is a proposition like any other, but I don’t think it’s a natural way to model them as first consult the credences they have assigned to “I prefer A to B” etc. Rather, they will just choose A ex hypothesis because that’s what having the preference means.
Why would they be uncertain about whether they’re a CDT agent? Being a CDT agent surely just means by definition they evaluate decisions based on causal outcomes. It feels confused to say that they have to be uncertain about/reflect on which decision theory they have and then apply it, rather than their being a CDT agent being an ex hypothesis fact about how they behave
Why not? Is it common for NDAs/non-disparagement agreements to also have a clause stating the parties aren’t allowed to tell anyone about it? I’ve never heard of this outside of super-injunctions which seems a pretty separate thing
Absolutely common. Most non-disparagement agreements are paired with non-disclosure agreements (or clauses in the non-disparagement wording) that prohibit talking about the agreement, as much as talking about the forbidden topics.
It's pretty obvious to lawyers that "I would like to say this, but I have a legal agreement that I won't" is equivalent, in many cases, to saying it outright.
They can presumably confirm whether or not there is a nondisparagement agreement and whether that is preventing them from commenting though right
I think (1b) doesn't go through. The "starting data" we have from (1a) is that the AGI has some preferences over lotteries that it competently acts on - acyclicality seems likely but we don't get completeness or transitivity for free, so we can't assume its preferences will be representable as maximising some utility function. (I suppose we also have the constraint that its preferences look "locally" good to us given training). But if this is all we have it doesn't follow that the agent will have some coherent goal it'd be want optimisers optimising toward... (read more)
Thanks for writing this. I think this is a lot clearer and more accessible that most write-ups on this topic and seems valuable.
I think the points around randomly-sampled plans being lethal, and expecting AGI to more closely randomly-sample plans, seem off though:
I don't see why lethal plans dominate the simplicity-weighted distribution if all we do is condition on plans that succeed. I expect the reasoning is "Lethal IC plans are more likely to succeed, therefore there are more minor (equally or barely more complex) variations of a given lethal plan that ... (read more)
Want to bump this because it seems important? How do you see the agent in the post as being dominated?
How is the toy example agent sketched in the post dominated?
Yeah I agree that even if they fall short of normative constraints there’s some empirical content around what happens in adversarial environments. I think I have doubts that this stuff translates to thinking about AGIs too much though, in the sense that there’s an obvious story of how an adversarial environment selected for (partial) coherence in us, but I don’t see the same kinds of selection pressures being a force on AGIs. Unless you assume that they’ll want to modify themselves in anticipation of adversarial environments which kinda begs the question
Kind of tangential but I'd be interested in your take on how strongly money-pumping etc is actually an argument against full-on cyclical preferences? One way to think about why getting money-pumped is bad is because you have an additional preference to not pay money to go nowhere. But it feels like all this tells us is that "something has to go", and if an agent is rationally permitted to modify its own preferences to avoid these situations then it seems a priori acceptable for it to instead just say something like "well actually I weight my cyclical prefe... (read more)
This seems totally different to the point OP is making which is that you can in theory have things that definitely are agents, definitely do have preferences, and are incoherent (hence not EV-maximisers) whilst not "predictably shooting themselves in the foot" as you claim must follow from this
I agree the framing of "there are no coherence theorems" is a bit needlessly strong/overly provocative in a sense, but I'm unclear what your actual objection is here - are you claiming these hypothetical agents are in fact still vulnerable to money-pumping? That they are in fact not possible?
Great post. I think a lot of the discussion around the role of coherence arguments and what we should expect a super-intelligent agent to behave like is really sloppy and I think this distinction between "coherence theorems as a self-contained mathematical result" and "coherence arguments as a normative claim about what an agent must be like on pain of shooting themselves in the foot" is an important one
The example of how an incomplete agent avoids getting Dutch-booked also seems to look very naturally like how irl agents behave imo. One way of thinking ab... (read more)
Ngl kinda confused how these points imply the post seems wrong, the bulk of this seems to be (1) a semantic quibble + (2) a disagreement on who has the burden of proof when it comes to arguing about the plausibility of coherence + (3) maybe just misunderstanding the point that's being made?
(1) I agree the title is a bit needlessly provocative and in one sense of course VNM/Savage etc count as coherence theorems. But the point is that there is another sense that people use "coherence theorem/argument" in this field which corresponds to something like "If yo... (read more)
Ah I hadn't realised Caspar wrote that, thanks for the link! I agree that seems to be getting at the same idea, and it's kind of separable from the multi-agent point
I'm probably misunderstanding you or I've worded things in a confusing way that I haven't noticed - I don't think anywhere it's implied what you do on Tails? The "iff" here is just saying you would be paid on Heads iff you would pay on Tails - the flip will happen regardless and the predictor hasn't made any prediction about the coin itself, just you're conditional behaviour
Edit: Maybe the "iff you will pay $1 on Tails" makes it sound like the predictor is predicting both the coin and your response, I'll edit to make more clear