This is a special post for quick takes by Anthony DiGiovanni. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

3 comments, sorted by Click to highlight new comments since: Today at 5:31 PM

Claims about counterfactual value of interventions given AI assistance should be consistent

A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).

I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose little-to-no risk of takeover—should also make us discount the counterfactual value of alignment research/engineering. The main plan of OpenAI’s alignment team, and part of Anthropic’s plan and those of several thought leaders in alignment, is to delegate alignment work (arguably the hardest parts thereof)[2] to AIs.

It’s plausible (and apparently a reasonably common view among alignment researchers) that:

  1. Aligning models on tasks that humans can evaluate just isn’t that hard, and would be done by labs for the purpose of eliciting useful capabilities anyway; and
  2. If we restrict to using predictive (non-agentic) models for assistance in aligning AIs on tasks humans can’t evaluate, they will pose very little takeover risk even if we don’t have a solution to alignment for AIs at their limited capability level.

It seems that if these claims hold, lots of alignment work would be made obsolete by AIs, not just s-risk-specific work. And I think several of the arguments for humans doing some alignment work anyway apply to s-risk-specific work:

  • In order to recognize what good alignment work (or good deliberation about reducing conflict risks) looks like, and provide data on which to finetune AIs who will do that work, we need to practice doing that work ourselves. (Christiano here, Wentworth here)
  • To the extent that working on alignment (or s-risks) ourselves gives us / relevant decision-makers evidence about how fundamentally difficult these problems are, we’ll have better guesses as to whether we need to push for things like avoiding deploying the relevant kinds of AI at all. (Christiano again)
  • For seeding the process that bootstraps a sequence of increasingly smart aligned AIs, you need human input at the bottom to make sure that process doesn’t veer off somewhere catastrophic—garbage in, garbage out. (O’Gara here.) AIs’ tendencies towards s-risky conflicts seem to be, similarly, sensitive to path-dependent factors (in their decision theory and priors, not just values, so alignment plausibly isn’t sufficient).

I would probably agree that alignment work is more likely to make a counterfactual difference to P(misalignment) than s-risk-targeted work is to make a counterfactual difference to P(s-risk), overall. But the gap seems to be overstated (and other prioritization considerations can outweigh this one, of course).

  1. ^

    That post focuses on technical interventions, but a non-technical intervention that seems pretty hard to delegate to AIs is to reduce race dynamics between AI labs, which lead to an uncooperative multipolar takeoff.

  2. ^

    I.e., the hardest part is ensuring the alignment of AIs on tasks that humans can't evaluate, where the ELK problem arises.

Is God's coin toss with equal numbers a counterexample to mrcSSA?

I feel confused as to whether minimal-reference-class SSA (mrcSSA) actually fails God's coin toss with equal numbers (where "failing" by my lights means "not updating from 50/50"):

  • Let H = "heads world", W_{me} = "I am in a white room, [created by God in the manner described in the problem setup]", R_{me} = "I have a red jacket."
  • We want to know P(H | W_{me}, R_{me}).
  • First, P(R_{me} | W_{me}, H) and P(R_{me} | W_{me}, ~H) seem uncontroversial: Once I've already conditioned on my own existence in this problem, and on who "I" am, but before I've observed my jacket color, surely I should use a principle of indifference: 1 out of 10 observers of existing-in-the-white-room in the heads world have red jackets, while all of them have red jackets in the tails world, so my credences are P(R_{me} | W_{me}, H) = 0.1 and P(R_{me} | W_{me}, ~H) = 1. Indeed we don't even need a first-person perspective at this step — it's the same as computing P(R_{Bob} | W_{Bob}, H) for some Bob we're considering from the outside.
    • (This is not the same as non-mrcSSA with reference class "observers in a white room," because we're conditioning on knowing "I" am an observer in a white room when computing a likelihood (as opposed to computing the posterior of some world given that I am an observer in a white room). Non-mrcSSA picks out a particular reference class when deciding how likely "I" am to observe anything in the first place, unconditional on "I," leading to the Doomsday Argument etc.)
  • The step where things have the potential for anthropic weirdness is in computing P(W_{me} | H) and P(W_{me} | ~H). In the Presumptuous Philosopher and the Doomsday Argument, at least, probabilities like this would indeed be sensitive to our anthropics.
  • But in this problem, I don't see how mrcSSA would differ from non-mrcSSA with the reference class R_{non-minimal} = "observers in a white room" used in Joe's analysis (and by extension, from SIA):
    • In general, SSA says
    • Here, the supposedly "non-minimal" reference class R_{non-minimal} coincides with the minimal reference class! I.e., it's the observer-moments in your epistemic situation (of being in a white room), before you know your jacket color.
  • The above likelihoods plus the fair-coin prior are all we need to get P(H | R_{me}, W_{me}), but at no point did the three anthropic views disagree.

In order words: It seems that the controversial setup in anthropics is in answering P(I [blah] | world), i.e., what we do when we introduce the indexical information about "I." But once we've picked out a particular "I," the different views should agree.

(I still feel suspicious of mrcSSA's metaphysics for independent reasons, but am considerably less confident in that than my verdict on God's coin toss with equal numbers.)

It seems that what I was missing here was: mrcSSA disputes my premise that the evidence in fact is "*I* am in a white room, [created by God in the manner described in the problem setup], and have a red jacket"!

Rather, mrcSSA takes the evidence to be: "Someone is in a white room, [created by God in the manner described in the problem setup], and has a red jacket." Which is of course certain to be the case given either heads or tails.

(h/t Jesse Clifton for helping me see this)