(Formerly "antimonyanthony.") I'm an s-risk-focused AI safety researcher at the Center on Long-Term Risk. I (occasionally) write about altruism-relevant topics on my Substack. All opinions my own.
It seems that what I was missing here was: mrcSSA disputes my premise that the evidence in fact is "*I* am in a white room, [created by God in the manner described in the problem setup], and have a red jacket"!
Rather, mrcSSA takes the evidence to be: "Someone is in a white room, [created by God in the manner described in the problem setup], and has a red jacket." Which is of course certain to be the case given either heads or tails.
(h/t Jesse Clifton for helping me see this)
Is God's coin toss with equal numbers a counterexample to mrcSSA?
I feel confused as to whether minimal-reference-class SSA (mrcSSA) actually fails God's coin toss with equal numbers (where "failing" by my lights means "not updating from 50/50"):
In order words: It seems that the controversial setup in anthropics is in answering P(I [blah] | world), i.e., what we do when we introduce the indexical information about "I." But once we've picked out a particular "I," the different views should agree.
(I still feel suspicious of mrcSSA's metaphysics for independent reasons, but am considerably less confident in that than my verdict on God's coin toss with equal numbers.)
I enjoyed this post and think it should help reduce confusion in many future discussions, thanks!
Some comments on your remarks about anthropics:
Different anthropic theories partially rely on metaphysical intuitions/stories about how centered worlds or observer moments are 'sampled', and have counterintuitive implications (e.g., the Doomsday argument for SSA and the Presumptuous philosopher for SIA).
I'm not sure why this is an indictment of "anthropic reasoning" per se, as if that's escapable. It seems like all anthropic theories are trying to answer a question that one needs to answer when forming credences, i.e., how do we form likelihoods P(I observe I exist | world W)? (Which we want in order to compute P(world W | I observe I exist).)
Indeed just failing to anthropically update at all has counterintuitive implications, like the verdict of minimal-reference-class SSA in Joe C's "God's coin toss with equal numbers." [no longer endorsed] And mrcSSA relies on the metaphysical intuition that oneself was necessarily going to observe X, i.e., P(I observe I exist | world W) = P(I observe I exist | not-W) = 1(which is quite implausible IMO). [I think endorsed, but I feel confused:] And mrcSSA relies on the metaphysical intuition that, given that someone observes X, oneself was necessarily going to observe X, which is quite implausible IMO.
in earlier sections you argue that CDT agents might not adopt LDT-recommended policies and so will have problems with bargaining
That wasn’t my claim. I was claiming that even if you're an "LDT" agent, there's no particular reason to think all your bargaining counterparts will pick the Fair Policy given you do. This is because:
So no, “committing to act like LDT agents all the time,” in the sense that is helpful for avoiding selection pressures against you, does not ensure you’ll have a decision procedure such that you have no bargaining problems.
But we were discussing a case(counterfactual mugging) where they would want to pre-commit to act in ways that would be non-causally beneficial.
I’m confused, the commitment is to act in a certain way that, had you not committed, wouldn’t be beneficial unless you appealed to acausal (and updateless) considerations. But the act of committing has causal benefits.
there are other reasons that you might not want to demand too much. Maybe you know their source code and can simulate that they will not accept a too-high demand. Or perhaps you think, based on empirical evidence or a priori reasoning that most agents you might encounter will only accept a roughly fair allocation.
I agree these are both important possibilities, but:
I was specifically replying to the claim that the sorts of AGIs who would get into high-stakes bargaining would always avoid catastrophic conflict because of bargaining problems; such a claim requires something stronger than the considerations you've raised, i.e., an argument that all such AGIs would adopt the same decision procedure (and account for logical causation) and therefore coordinate their demands.
(By default if I don't reply further, it's because I think your further objections were already addressed—which I think is true of some of the things I've replied to in this comment.)
It's true that you usually have some additional causal levers, but none of them are the exact same as be the kind of person who does X.
Not sure I understand. It seems like "being the kind of person who does X" is a habit you cultivate over time, which causally influences how people react to you. Seems pretty analogous to the job candidate case.
if CDT agents often modify themselves to become an LDT/FDT agent then it would broadly seem accurate to say that CDT is getting outcompeted
See my replies to interstice's comment—I don't think "modifying themselves to become an LDT/FDT agent" is what's going on, at least, there doesn't seem to be pressure to modify themselves to do all the sorts of things LDT/FDT agents do. They come apart in cases where the modification doesn't causally influence another agent's behavior.
(This seems analogous to claims that consequentialism is self-defeating because the "consequentialist" decision procedure leads to worse consequences on average. I don't buy those claims, because consequentialism is a criterion of rightness, and there are clearly some cases where doing the non-consequentialist thing is a terrible idea by consequentialist lights even accounting for signaling value, etc. It seems misleading to call an agent a non-consequentialist if everything they do is ultimately optimizing for achieving good consequences ex ante, even if they adhere to some rules that have a deontological vibe and in a given situation may be ex post suboptimal.)
It seems plausible that there is no such thing as "correct" metaphilosophy, and humans are just making up random stuff based on our priors and environment and that's it and there is no "right way" to do philosophy, similar to how there are no "right preferences"
If this is true, doesn't this give us more reason to think metaphilosophy work is counterfactually important, i.e., can't just be delegated to AIs? Maybe this isn't what Wei Dai is trying to do, but it seems like "figure out which approaches to things (other than preferences) that don't have 'right answers' we [assuming coordination on some notion of 'we'] endorse, before delegating to agents smarter than us" is time-sensitive, and yet doesn't seem to be addressed by mainstream intent alignment work AFAIK.
(I think one could define "intent alignment" broadly enough to encompass this kind of metaphilosophy, but I smell a potential motte-and-bailey looming here if people want to justify particular research/engineering agendas labeled as "intent alignment.")
You said "Bob commits to LDT ahead of time"
In the context of that quote, I was saying why I don't buy the claim that following LDT gives you advantages over committing to, in future problems, do stuff that's good for you to commit to do ex ante even if it would be bad for you ex post had you not been committed.
What is selected-for is being the sort of agent who, when others observe you, they update towards doing stuff that's good for you. This is distinct from being the sort of agent who does stuff that would have helped you if you had been able to shape others' beliefs / incentives, when in fact you didn't have such an opportunity.
I think a CDT agent would pre-commit to paying in a one-off Counterfactual Mugging
Sorry I guess I wasn't clear what I meant by "one-shot" here / maybe I just used the wrong term—I was assuming the agent didn't have the opportunity to commit in this way. They just find themselves presented with this situation.
Same as above
Hmm, I'm not sure you're addressing my point here:
Imagine that you're an AGI, and either in training or earlier in your lifetime you faced situations where it was helpful for you to commit to, as above, "do stuff that's good for you to commit to do ex ante even if it would be bad for you ex post had you not been committed." You tended to do better when you made such commitments.
But now you find yourself thinking about this commitment races stuff. And, importantly, you have not previously broadcast credible commitments to a bargaining policy to your counterpart. Do you have compelling reasons to think you and your counterpart have been selected to have decision procedures that are so strongly logically linked, that your decision to demand more than a fair bargain implies your counterpart does the same? I don't see why. But that's what we'd need for the Fair Policy to work as robustly as Eliezer seems to think it does.
Yeah, this is a complicated question. I think some things can indeed safely be deferred, but less than you’re suggesting. My motivations for researching these problems:
The key point is that "acting like an LDT agent" in contexts where your commitment causally influences others' predictions of your behavior, does not imply you'll "act like an LDT agent" in contexts where that doesn't hold. (And I would dispute that we should label making a commitment to a mutually beneficial deal as "acting like an LDT agent," anyway.) In principle, maybe the simplest generalization of the former is LDT. But if doing LDT things in the latter contexts is materially costly for you (e.g. paying in a truly one-shot Counterfactual Mugging), seems to me that LDT would be selected against.
ETA: The more action-relevant example in the context of this post, rather than one-shot CM, is: "Committing to a fair demand, when you have values and priors such that a more hawkish demand would be preferable ex ante, and the other agents you'll bargain with don't observe your commitment before they make their own commitments." I don't buy that that sort of behavior is selected for, at least not strongly enough to justify the claim I respond to in the third section.
Exploitation means the exploiter benefits. If you are a rock, you can't be exploited. If you are an agent who never gives in to threats, you can't be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won't benefit them, then you might get nasty things done to you. You wouldn't be exploited, but you'd still be very unhappy.
Cool, I think we basically agree on this point then, sorry for misunderstanding. I just wanted to emphasize the point I made because "you won't get exploited if you decide not to concede to bullies" is kind of trivially true. :) The operative word in my reply was "robustly," which is the hard part of dealing with this whole problem. And I think it's worth keeping in mind how "doing nasty things to you anyway even though it won't benefit them" is a consequence of a commitment that was made for ex ante benefits, it's not the agent being obviously dumb as Eliezer suggests. (Fortunately, as you note in your other comment, some asymmetries should make us think these commitments are rare overall; I do think an agent probably needs to have a pretty extreme-by-human-standards, little-to-lose value system to want to do this... but who knows what misaligned AIs might prefer.)