Claims about counterfactual value of interventions given AI assistance should be consistent
A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).
I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose little-to-no risk of takeover—should also make us discount the counterfactual value of alignment research/engineering. The main plan of OpenAI’s alignment team, and part of Anthropic’s plan and those of several thought leaders in alignment, is to delegate alignment work (arguably the hardest parts thereof)[2] to AIs.
It’s plausible (and apparently a reasonably common view among alignment researchers) that:
It seems that if these claims hold, lots of alignment work would be made obsolete by AIs, not just s-risk-specific work. And I think several of the arguments for humans doing some alignment work anyway apply to s-risk-specific work:
I would probably agree that alignment work is more likely to make a counterfactual difference to P(misalignment) than s-risk-targeted work is to make a counterfactual difference to P(s-risk), overall. But the gap seems to be overstated (and other prioritization considerations can outweigh this one, of course).
That post focuses on technical interventions, but a non-technical intervention that seems pretty hard to delegate to AIs is to reduce race dynamics between AI labs, which lead to an uncooperative multipolar takeoff.
I.e., the hardest part is ensuring the alignment of AIs on tasks that humans can't evaluate, where the ELK problem arises.
Is God's coin toss with equal numbers a counterexample to mrcSSA?
I feel confused as to whether minimal-reference-class SSA (mrcSSA) actually fails God's coin toss with equal numbers (where "failing" by my lights means "not updating from 50/50"):
In order words: It seems that the controversial setup in anthropics is in answering P(I [blah] | world), i.e., what we do when we introduce the indexical information about "I." But once we've picked out a particular "I," the different views should agree.
(I still feel suspicious of mrcSSA's metaphysics for independent reasons, but am considerably less confident in that than my verdict on God's coin toss with equal numbers.)
It seems that what I was missing here was: mrcSSA disputes my premise that the evidence in fact is "*I* am in a white room, [created by God in the manner described in the problem setup], and have a red jacket"!
Rather, mrcSSA takes the evidence to be: "Someone is in a white room, [created by God in the manner described in the problem setup], and has a red jacket." Which is of course certain to be the case given either heads or tails.
(h/t Jesse Clifton for helping me see this)