[Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy

Rohin Shah

I'm so glad you are making a plan and sharing it publicly!

Fun, possibly impactful idea: Have a livestreamed chat with Jan Leike (or some other representative from OpenAI's alignment team) where you discuss and critique each other's plans & discuss how you can support each other by sharing research etc.

[-]Hoagy3yΩ280

Could you explain why you think "The game is skewed in our favour."?

[-]Vika3yΩ361

Just added some more detail on this to the slides. The idea is that we have various advantages over the model during the training process: we can restart the search, examine and change beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.

[-]baturinsky3yΩ0119

While the model has the advantage of only having to "win" once.

[-]Rohin Shah3yΩ6123

I think that skews it somewhat but not very much. We only have to "win" once in the sense that we only need to build an aligned Sovereign that ends the acute risk period once, similarly to how we only have to "lose" once in the sense that we only need to build a misaligned superintelligence that kills everyone once.

(I like the discussion on similar points in the strategy-stealing assumption.)

[-]David Johnston3yΩ470

Is building an aligned sovereign to end the acute risk period different to a pivotal act in your view?

[-]Rohin Shah3yΩ330

Depends what the aligned sovereign does! Also depends what you mean by a pivotal act!

In practice, during the period of time where biological humans are still doing a meaningful part of alignment work, I don't expect us to build an aligned sovereign, nor do I expect to build a single misaligned AI that takes over: I instead expect there to be a large number of AI systems, that could together obtain a decisive strategic advantage, but could not do so individually.

[-]David Johnston3yΩ460

So, if I'm understanding you correctly:

if it's possible to build a single AI system that executes a catastrophic takeover (via self-bootstrap or whatever), it's also probably possible to build a single aligned sovereign, and so in this situation winning once is sufficient
if it is not possible to build a single aligned sovereign, then it's probably also not possible to build a single system that executes a catastrophic takeover and so the proposition that the model only has to win once is not true in any straightforward way
- in this case, we might be able to think of "composite AI systems" that can catastrophically take over or end the acute risk period, and for similar reasons as in the first scenario, winning once with a composite system is sufficient, but such systems are not built from single acts

and you think the second scenario is more likely than the first.

[-]Rohin Shah3y*Ω340

Yes, that's right, though I'd say "probable" not "possible" (most things are "possible").

[-]Gabe M3y73

This does feel pretty vague in parts (e.g. "mitigating goal misgeneralization" feels more like a problem statement than a component of research), but I personally think this is a pretty good plan, and at the least, I'm very appreciative of you posting your plan publicly!

Now, we just need public alignment plans from Anthropic, Google Brain, Meta, Adept, ...

[-]GunZoR3yΩ140

But what stops a blue-cloud model from transitioning into a red-cloud model if the blue-cloud model is an AGI like the one hinted at on your slides (self-aware, goal-directed, highly competent)?

[-]Vika3yΩ120

We expect that an aligned (blue-cloud) model would have an incentive to preserve its goals, though it would need some help from us to generalize them correctly to avoid becoming a misaligned (red-cloud) model. We talk about this in more detail in Refining the Sharp Left Turn (part 2).

[-]Review Bot2y*30

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

128

[Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy

128

Ω 47

128

Ω 47