antimonyanthony

I'm Anthony DiGiovanni, a suffering-focused AI safety researcher at the Center on Long-Term Risk. I (occasionally) write about altruism-relevant topics on my blog, Ataraxia. All opinions my own.

Wiki Contributions

Comments

The second mover ALREADY had the option not to commit - they could just swerve or crash, according to their decision theory.

The premise here is that the second-mover decided to commit soon after the first-mover did, because the proof of the first-mover's initial commitment didn't reach the second-mover quickly enough. They could have not committed initially, but they decided to do so because they had a chance of being first.

I'm not sure exactly what you mean by "according to their decision theory" (as in, what this adds here).

if it doesn't change the sequence of commitment, I don't see how it makes any difference at all

The difference is that the second-mover can say "oh shit I committed before getting the broadcast of the first-mover's commitment—I'd prefer to revoke this commitment because it's pointless, my commitment doesn't shape the first-mover's incentives in any way since I know the first-mover will just prefer to keep their commitment fixed."

As I said, the first-mover doesn't lose their advantage from this at all, because their commitment is locked (at their freeze time) before the second-mover's. So they can just leave their commitment in place, and their decision won't be swayed by the second-mover's at all because of the rule: "You shouldn’t be able to reveal the final decision to anyone before freeze_time because we don’t want the commitment to get credible before freeze_time."

better off having a "real" commitment than a revocable commitment that Bob can talk her out of

I'm confused what you mean here. In principle Alice can revoke her commitment before the freeze time in this protocol, but Bob can't force her to do so. And if it's common knowledge that Alice's freeze time comes before Bob's, then: Since Alice knows that there will be a window after her freeze time where Bob knows Alice's commitment is frozen, and Bob has a chance to revert, then there would be no reason (barring some other commitment mechanism, including Bob being verifiably updateless while Alice isn't) for Bob not to revoke (to Swerve) if Alice refused to revert from Dare. So Alice would practically always keep her commitment.

The power to revoke commitments here is helpful in the hands of the second-mover, who made the initial incompatible commitment because of, e.g., some lag time between the first-mover's making and broadcasting the commitment.

I'd recommend checking out this post critiquing this view, if you haven't read it already. Summary of the counterpoints:

  • (Intent) alignment doesn't seem sufficient to ensure an AI makes safe decisions about subtle bargaining problems in a situation of high competitive pressure with other AIs. I don't expect the kinds of capabilities progress that is incentivized by default to suffice for us to be able to defer these decisions to the AI, especially given path-dependence on feedback from humans who'd be pretty naïve about this stuff. (C.f. this post—you need the human feedback at bottom to be sufficiently high quality to not get garbage-in, garbage-out problems even if you've solved the hard parts of alignment.)
  • To the extent that solving all of intent alignment is too intractable, focusing on subsets of alignment that are especially likely to avoid s-risks—e.g. preventing AIs from intrinsically valuing frustrating others' preferences—might be promising. I don't think mainstream alignment research prioritizes these.

Claims about counterfactual value of interventions given AI assistance should be consistent

A common claim I hear about research on s-risks is that it’s much less counterfactual than alignment research, because if alignment goes well we can just delegate it to aligned AIs (and if it doesn’t, there’s little hope of shaping the future anyway).

I think there are several flaws with this argument that require more object-level context (see this post).[1] But at a high level, this consideration—that research/engineering can be delegated to AIs that pose little-to-no risk of takeover—should also make us discount the counterfactual value of alignment research/engineering. The main plan of OpenAI’s alignment team, and part of Anthropic’s plan and those of several thought leaders in alignment, is to delegate alignment work (arguably the hardest parts thereof)[2] to AIs.

It’s plausible (and apparently a reasonably common view among alignment researchers) that:

  1. Aligning models on tasks that humans can evaluate just isn’t that hard, and would be done by labs for the purpose of eliciting useful capabilities anyway; and
  2. If we restrict to using predictive (non-agentic) models for assistance in aligning AIs on tasks humans can’t evaluate, they will pose very little takeover risk even if we don’t have a solution to alignment for AIs at their limited capability level.

It seems that if these claims hold, lots of alignment work would be made obsolete by AIs, not just s-risk-specific work. And I think several of the arguments for humans doing some alignment work anyway apply to s-risk-specific work:

  • In order to recognize what good alignment work (or good deliberation about reducing conflict risks) looks like, and provide data on which to finetune AIs who will do that work, we need to practice doing that work ourselves. (Christiano here, Wentworth here)
  • To the extent that working on alignment (or s-risks) ourselves gives us / relevant decision-makers evidence about how fundamentally difficult these problems are, we’ll have better guesses as to whether we need to push for things like avoiding deploying the relevant kinds of AI at all. (Christiano again)
  • For seeding the process that bootstraps a sequence of increasingly smart aligned AIs, you need human input at the bottom to make sure that process doesn’t veer off somewhere catastrophic—garbage in, garbage out. (O’Gara here.) AIs’ tendencies towards s-risky conflicts seem to be, similarly, sensitive to path-dependent factors (in their decision theory and priors, not just values, so alignment plausibly isn’t sufficient).

I would probably agree that alignment work is more likely to make a counterfactual difference to P(misalignment) than s-risk-targeted work is to make a counterfactual difference to P(s-risk), overall. But the gap seems to be overstated (and other prioritization considerations can outweigh this one, of course).

  1. ^

    That post focuses on technical interventions, but a non-technical intervention that seems pretty hard to delegate to AIs is to reduce race dynamics between AI labs, which lead to an uncooperative multipolar takeoff.

  2. ^

    I.e., the hardest part is ensuring the alignment of AIs on tasks that humans can't evaluate, where the ELK problem arises.

primarily because models will understand the base goal first before having world modeling

Could you say a bit more about why you think this? My definitely-not-expert expectation would be that the world-modeling would come first, then the "what does the overseer want" after that, because that's how the current training paradigm works: pretrain for general world understanding, then finetune on what you actually want the model to do.

"I am devoting my life to solving the most important problems in the world and alleviating as much suffering as possible" fits right into the script. That's exactly the kind of thing you are supposed to be thinking. If you frame your life like that, you will fit in and everyone will understand and respect what is your basic deal.

Hm, this is a pretty surprising claim to me. It's possible I haven't actually grown up in a "western elite culture" (in the U.S., it might be a distinctly coastal thing, so the cliché goes? IDK). Though, I presume having gone to some fancypants universities in the U.S. makes me close enough to that. The Script very much did not encourage me to devote my life to solving the most important problems and alleviating as much suffering as possible, and it seems not to have encouraged basically any of my non-EA friends from university to do this. I/they were encouraged to have careers that were socially valuable, to be sure, but not the main source of purpose in their lives or a big moral responsibility.

A model that just predicts "what the 'correct' choice is" doesn't seem likely to actually do all the stuff that's instrumental to preventing itself from getting turned off, given the capabilities to do so.

But I'm also just generally confused whether the threat model here is, "A simulated 'agent' made by some prompt does all the stuff that's sufficient to disempower humanity in-context, including sophisticated stuff like writing to files that are read by future rollouts that generate the same agent in a different context window," or "The RLHF-trained model has goals that it pursues regardless of the prompt," or something else.

confused claims that treat (base) GPT3 and other generative models as traditional rational agents

I'm pretty surprised to hear that anyone made such claims in the first place. Do you have examples of this?

I think you might be misunderstanding Jan's understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate's notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI's focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.

I agree with your guesses.

I am not sure that "controlling for game-theoretic instrumental reasons" is actually a move that is well defined/makes sense.

I don't have a crisp definition of this, but I just mean that, e.g., we compare the following two worlds: (1) 99.99% of agents are non-sentient paperclippers, and each agent has equal (bargaining) power. (2) 99.99% of agents are non-sentient paperclippers, and the paperclippers are all confined to some box. According to plenty of intuitive-to-me value systems, you only (maybe) have reason to increase paperclips in (1), not (2). But if the paperclippers felt really sad about the world not having more paperclips, I'd care—to an extent that depends on the details of the situation—about increasing paperclips even in (2).

Load More