Just stumbled onto this talk and thought it would be worth sharing. In it, Buck discusses the methods we would use to align transformative AI (TAI) if it were developed soon and what we can do to improve them. The talk helped me think about how alignment might play out and I liked the division of alignment research into five classes. Somehow thinking about these categories makes it easier for me to think about my own research priorities.


Buck imagines some company he calls Magma[1] that is close to deploying an TAI. If they don't do so, other AI systems may cause a catastrophe, so they are under some pressure to deploy relatively soon. They need to deploy the AI in a way that addresses these risks from other AIs and poses low risk from their own AI.

Ways in which the AI might help

There are a couple of things Magma might want to deploy their AI for, including:

  • Helping with alignment – alignment research, but also more basic stuff like giving feedback on another AIs outputs.
  • Enforcing of AI development agreements.
  • Making it harder for unaligned systems to wreak havoc
    • Hardening – patching security flaws in our current systems.
    • Eating the low-hanging fruit – if there are easy ways for AIs to gain resources, we would rather they be done by aligned AIs than by unaligned ones. 
    • Detection of dangerous systems.

The v0 plan

In order to train a model to be useful, Magma will probably train it on a mixture of the following tasks:

  1. Tasks that cheaply makes the model know lots of stuff
  2. Demonstration of and human approval related to the task we actually care about

For ChatGPT, (1) was next-token-prediction on the internet and (2) was a mixture of RLHF and supervised finetuning on assistant-type tasks.

Problems with the plan

Buck sees two ways this could go poorly, ways in which the model might achieve low loss in training but still take over the world:

  1. We were evaluating them poorly.
  2. It does something catastrophic at deployment but not during training (the King Lear[2] problem.)

Roughly speaking, this distinction corresponds to outer and inner alignment. If we are evaluating a model poorly, that could lead to "we get what we can measure" type failure modes. If the model does something catastrophic at deployment, it might be trying to appear aligned because it wants to be deployed, as described in Ajeya Cotra's takeover scenario post. 

The v1 plan

To address the above issues, Magma might do some simple stuff like:

  • Improve oversight by using AIs to oversee other AIs – things like IDA and debate are examples of this. Unfortunately we don't have strong reasons to believe this will always work, so it might not be enough.
  • To address the King Lear problem: make it hard for the model to tell if its being deployed or not by producing fake defection opportunities. The hope here is that we will be able to catch models trying to execute a treacherous turn.
  • Deployment time stuff: AI checks and balances, whistleblowing, etc. 

Thoughts on v1 plan

Buck has some thoughts on this plan:

  • One big question is how well AIs will be able to coordinate. If they can do so well, then strategies that involve pitting them against each other (e.g. Debate) are less promising.
  • The AIs might all be able to stage a coup together, but doing so seems pretty hard. One intuition pump here is that dictators like Putin are often able to control a population of people that mostly dislike them.
  • Buck's P(doom) fluctuate, but was 25% when he gave the talk.

What alignment research should we do then?

This is the thought process Buck goes through when thinking about his research priorities:

Assume that a lab ends up in the situation described earlier, and they are being as smart as possible about handling the alignment problems. How can I produce helpful changes to their alignment plan by doing technical research now?

He divides technical research into 5 classes:

  1. Ready to go – These are techniques that, once developed, could pretty straightforwardly be used to align future models. Examples include critiques, scalable oversight, and model psychology. 
  2. Worst case techniques – These are techniques that will work regardless of empirical contingencies (e.g. how SGD works). One example is ARC theory's work.
  3. Gain of Function research – Providing examples of ways in which models today misbehave. This helps because it provides evidence of dangers and lets gives us a way to study how to mitigate them. For instance, if we had an ELK benchmark, we could test out a bunch of empirical techniques today.
  4. Low-hanging fruit – This is research into things that are easy to do and might be useful. Doing this can be seen as making a bag of cheap tricks that we can later use to align our models. On example here is the "find a direction in activation space" part of Burns et al, "Discovering Latent Knowledge in Language Models Without Supervision"
  5. Big if true – Ambitious stuff which might be possible and, if so, would be extremely helpful. One example here is interpretability research. Unfortunately we currently have no examples of big-if-true techniques graduating to ready to go techniques.

What an optimal portfolio might look like

Here's Buck's best guess at what an optimal portfolio would look like.

  • Development of new alignment techniques: 75%
    • Ready to go: 30%
    • Worst-case: 15%
    • Low-hanging fruit: 5%
    • Big if true ( all is model internals): 25%
  • Gain of function: 25%

Buck added that this excludes technical research on evals. I'm currently not sure what exactly technical research on evals is and how it would fit into the current portfolio, but it seems good to mention.

Random notes

Buck's research is very much informed by how he thinks AI will go and what type of stuff will be useful in those situations. A lot of other research isn't done like this. Many other researchers follow heuristics like, "Is this interesting?", "Is this tractable?", or "Are other people doing it?" It might turn out that the latter strategy is in fact better, but Buck is still willing to bet on the former strategy for now.

  1. ^

    In following with Ajeya Cotra's takeover scenario post.

  2. ^

    Names after a fictional King who had to decide which of his daughters should succeed him. He chose the ones that acted nicely, but it turned out they were just acting nicely in order to be chosen by the King. (Or something like that. I haven't read the play.)

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 11:05 AM

Thanks for the summary :)