All of keith_wynroe's Comments + Replies

I understand it’s a proposition like any other, I don’t see why an agent would reflect on it/use it in their deliberation to decide what to do. The fact that they’re a CDT agent is a fact about how they will act in the decision, not a fact that they need to use in their deliberation

Analogous to preferences, whether or not an agent prefers A or B is a proposition like any other, but I don’t think it’s a natural way to model them as first consult the credences they have assigned to “I prefer A to B” etc. Rather, they will just choose A ex hypothesis because that’s what having the preference means.

3Max H6d
They don't have to, and for the purposes of the thought experiment you could specify that they simply don't. But humans are often pretty uncertain about their own preferences, and about what kind of decision theory they can or should use. Many of these humans are pretty strongly inclined to deliberate, reflect, and argue about these beliefs, and take into account their own uncertainty in them when making decisions. So I'm saying that if you stipulate that no such reflection or deliberation occurs, you might be narrowing the applicability of the thought experiment to exclude human-like minds, which may be a rather large and important class of all possible minds.

Why would they be uncertain about whether they’re a CDT agent? Being a CDT agent surely just means by definition they evaluate decisions based on causal outcomes. It feels confused to say that they have to be uncertain about/reflect on which decision theory they have and then apply it, rather than their being a CDT agent being an ex hypothesis fact about how they behave

0Max H6d
What kind of decision theory the agent will use and how it will behave in specific circumstances is a proposition about the agent and about the world which can be true or false, just like any other proposition. Within any particular mind (the agent's own mind, Omega's, ours) propositions of any kind come with differing degrees of uncertainty attached, logical or otherwise. You can of course suppose for the purposes of the thought experiment that the agent itself has an overwhelmingly strong prior belief about its own behavior, either because it can introspect and knows its own mind well, or because it has been told how it will behave by a trustworthy and powerful source such as Omega. You could also stipulate that the entirety of the agent is a formally specified computer program, perhaps one that is incapable of reflection or forming beliefs about itself at all. However, any such suppositions would be additional constraints on the setup, which makes it weirder and less likely to be broadly applicable in real life (or anywhere else in the multiverse).  Also, even if the agent starts with 99.999%+ credence that it is a CDT agent and will behave accordingly, perhaps there's some chain of thoughts the agent could think which provide strong enough evidence to overcome this prior (and thus change the agent's behavior). Perhaps this chain of thoughts is very unlikely to actually occur to the agent, or for the purposes of the thought experiment, we only consider agents which flatly aren't capable of such introspection or belief-updating. But that may (or may not!) mean restricting the applicability of the thought experiment to a much smaller and rather impoverished class of minds.

Why not? Is it common for NDAs/non-disparagement agreements to also have a clause stating the parties aren’t allowed to tell anyone about it? I’ve never heard of this outside of super-injunctions which seems a pretty separate thing

my boilerplate severance agreement at a job included an NDA that couldn't be acknowledged (I negotiated to change this).

Absolutely common.  Most non-disparagement agreements are paired with non-disclosure agreements (or clauses in the non-disparagement wording) that prohibit talking about the agreement, as much as talking about the forbidden topics.  

It's pretty obvious to lawyers that "I would like to say this, but I have a legal agreement that I won't" is equivalent, in many cases, to saying it outright.

They can presumably confirm whether or not there is a nondisparagement agreement and whether that is preventing them from commenting though right

You can confirm this if you're aware that it's a possibility, and interpret carefully-phrased refusals to comment in a way that's informed by reasonable priors. You should not assume that anyone is able to directly tell you that an agreement exists.

I think (1b) doesn't go through. The "starting data" we have from (1a) is that the AGI has some preferences over lotteries that it competently acts on - acyclicality seems likely but we don't get completeness or transitivity for free, so we can't assume its preferences will be representable as maximising some utility function. (I suppose we also have the constraint that its preferences look "locally" good to us given training). But if this is all we have it doesn't follow that the agent will have some coherent goal it'd be want optimisers optimising toward... (read more)

Thanks for writing this. I think this is a lot clearer and more accessible that most write-ups on this topic and seems valuable.

I think the points around randomly-sampled plans being lethal, and expecting AGI to more closely randomly-sample plans, seem off though:

I don't see why lethal plans dominate the simplicity-weighted distribution if all we do is condition on plans that succeed. I expect the reasoning is "Lethal IC plans are more likely to succeed, therefore there are more minor (equally or barely more complex) variations of a given lethal plan that ... (read more)

Want to bump this because it seems important? How do you see the agent in the post as being dominated?

How is the toy example agent sketched in the post dominated?

Yeah I agree that even if they fall short of normative constraints there’s some empirical content around what happens in adversarial environments. I think I have doubts that this stuff translates to thinking about AGIs too much though, in the sense that there’s an obvious story of how an adversarial environment selected for (partial) coherence in us, but I don’t see the same kinds of selection pressures being a force on AGIs. Unless you assume that they’ll want to modify themselves in anticipation of adversarial environments which kinda begs the question

Hmm, I was going to reply with something like "money-pumps don't just say something about adversarial environments, they also say something about avoiding leaking resources" (e.g. if you have circular preferences between proximity to apples, bananas, and carrots, then if you encounter all three of them in a single room you might get trapped walking between them forever) but that's also begging your original question - we can always just update to enjoy leaking resources, transmuting a "leak" into an "expenditure". Another frame here is that if you make/encounter an agent, and that agent self-modifies into/starts off as something which is happy to leak pretty fundamental resources like time and energy and material-under-control, then you're not as worried about it? It's certainly not competing as strongly for the same resources as you whenever it's "under the influence" of its circular preferences.

Kind of tangential but I'd be interested in your take on how strongly money-pumping etc is actually an argument against full-on cyclical preferences? One way to think about why getting money-pumped is bad is because you have an additional preference to not pay money to go nowhere. But it feels like all this tells us is that "something has to go", and if an agent is rationally permitted to modify its own preferences to avoid these situations then it seems a priori acceptable for it to instead just say something like "well actually I weight my cyclical prefe... (read more)

(I'm not EJT, but for what it's worth:) I find the money-pumping arguments compelling not as normative arguments about what preferences are "allowed", but as engineering/security/survival arguments about what properties of preferences are necessary for them to be stable against an adversarial environment (which is distinct from what properties are sufficient for them to be stable, and possibly distinct from questions of self-modification).

This seems totally different to the point OP is making which is that you can in theory have things that definitely are agents, definitely do have preferences, and are incoherent (hence not EV-maximisers) whilst not "predictably shooting themselves in the foot" as you claim must follow from this

I agree the framing of "there are no coherence theorems" is a bit needlessly strong/overly provocative in a sense, but I'm unclear what your actual objection is here - are you claiming these hypothetical agents are in fact still vulnerable to money-pumping? That they are in fact not possible? 

Great post. I think a lot of the discussion around the role of coherence arguments and what we should expect a super-intelligent agent to behave like is really sloppy and I think this distinction between "coherence theorems as a self-contained mathematical result" and "coherence arguments as a normative claim about what an agent must be like on pain of shooting themselves in the foot" is an important one

The example of how an incomplete agent avoids getting Dutch-booked also seems to look very naturally like how irl agents behave imo. One way of thinking ab... (read more)

Ngl kinda confused how these points imply the post seems wrong, the bulk of this seems to be (1) a semantic quibble + (2) a disagreement on who has the burden of proof when it comes to arguing about the plausibility of coherence + (3) maybe just misunderstanding the point that's being made?

(1) I agree the title is a bit needlessly provocative and in one sense of course VNM/Savage etc count as coherence theorems. But the point is that there is another sense that people use "coherence theorem/argument" in this field which corresponds to something like "If yo... (read more)

Ah I hadn't realised Caspar wrote that, thanks for the link! I agree that seems to be getting at the same idea, and it's kind of separable from the multi-agent point

I'm probably misunderstanding you or I've worded things in a confusing way that I haven't noticed - I don't think anywhere it's implied what you do on Tails? The "iff" here is just saying you would be paid on Heads iff you would pay on Tails - the flip will happen regardless and the predictor hasn't made any prediction about the coin itself, just you're conditional behaviour

Edit: Maybe the "iff you will pay $1 on Tails" makes it sound like the predictor is predicting both the coin and your response, I'll edit to make more clear