Beliefs and Disagreements about Automating Alignment Research

Ian McKenzie

Epistemic status: Mostly organizing and summarizing the views of others.

Thanks to those whose views I summarized in this post, and to Tamera Lanham, Nicholas Kees Dupuis, Daniel Kokotajlo, Peter Barnett, Eli Lifland, and Logan Smith for reviewing a draft.

Introduction

In my current view of the alignment problem, there are two paths that we could try to take:

Come up with an alignment strategy that allows us to both build aligned AGI and to keep that AGI (or its successors) aligned as they improve towards superintelligence
Come up with an alignment strategy that allows us to build AI systems that are powerful (but not so powerful as to be themselves dangerous) and use that to execute some kind of ‘pivotal act’ that means that misaligned ASI is not built

For the purposes of this post, I am going to assume that we are unable to do (1) – maybe the problem is too difficult, or we don’t have time – and focus on (2).

Within the category of ‘pivotal act’, I see two main types:

Preventative pivotal acts: acts that makes it impossible for anyone to build AGI for a long period of time
Constructive pivotal acts: acts that makes it possible to build aligned ASI

People disagree about whether preventative pivotal acts are possible or even if they were possible, if they’d be a good idea. Again, for the purposes of this post, I am going to assume we can’t or don’t want to execute a preventative pivotal act, and focus on constructive pivotal acts. In particular: can we use AI to automate alignment research safely?

What does ‘automating alignment research’ even mean?

I see three overlapping categories that one could mean when referring to ‘automating alignment research’, ordered in terms of decreasing human involvement:

Level 1: AIs help humans work faster
1. Examples include brainstorming, intelligent autocomplete, and automated summarization/explanation.
Level 2: AIs produce original contributions
1. This could be key insights into the nature of intelligence, additional problems that were overlooked, or entire alignment proposals.
Level 3: AIs build aligned successors
1. Here, we have an aligned AGI that we entrust with building a successor. At this point, the current aligned AGI has to do all the alignment research required to ensure that its successor is aligned.

Mostly I have been thinking about Levels 1 and 2, although some people I spoke to (e.g. Richard Ngo) were more focused on Level 3.

Current state of automating alignment

At the moment, we are firmly at Level 1. Models can produce similar-sounding ideas when prompted with existing ideas and are pretty good at completing code but are not great at summarizing or explaining complex ideas. Tools like Loom and Codex can provide speed-ups but seem unlikely to be decisive.

Whether we get to Level 2 soon or whether Level 2 is already beyond the point where AI systems are dangerous are key questions that researchers disagree on.

Key disagreements

Generative models vs agents

Much of the danger from powerful AI systems comes from them pursuing coherent goals that persist across inputs. If we can build generative models that do not pursue goals in this way, then perhaps these will provide a way to extract intelligent behavior from advanced systems safely.

Timing of emergence of deception vs intelligence

Related to the problem of agents, there is also disagreement about whether we get systems that are intelligent enough to be useful for automating alignment before they are misaligned enough (e.g. deceptive or power-seeking) to be dangerous. My understanding is that Nate and Eliezer are quite confident that the useful intelligence comes only after they are already misaligned, whereas most other people are more uncertain about this.

The ‘hardness’ of generating alignment insights

This could be seen as another framing of the above point: how smart does the system have to be to do useful, original thinking for us? Does it have to have a comprehensive understanding of how minds in general work, or can original insights be generated by cleverly combining John Wentworth posts, or John Wentworth posts with Paul Christiano posts?

The benefits (in terms of time saved) of Level 1 interventions

It is unclear how much time is saved by Level 1 interventions: if all alignment researchers were regularly using Loom to write faster, brainstorming with GPT-3, and coding with Copilot, would this result in an appreciable speed-up of alignment work?

Summaries of viewpoints on automating alignment research

Below, I summarize the positions of various alignment researchers I have spoken to about this topic. Where possible, I have had the people in question review my summary to ensure I am not misrepresenting them too badly.

Nate Soares (unreviewed)

Solving alignment requires understanding how to control minds. If we want AI systems to solve the hard parts of alignment for us, then necessarily they will understand how to control minds in a way that we do not. Understanding how to control minds requires a ‘higher grade of cognition’ than most engineering tasks, and so a system capable enough to solve alignment is also capable of doing many dangerous things (we cannot teach AIs to drive a blue car without also being able to drive red cars). The good outcomes that we want (complete, working alignment solutions) are a sufficiently small target that we do not know how to direct a dangerous AI towards that outcome: doing this safely is precisely the alignment problem, and so we have not made our task meaningfully easier.

You don’t get around this by saying you’re using a specific architecture or technique, like scaling up GPTs. You are trying to channel the future into a specific, small target – a world where we have ended the acute risk period from AGI and have time to contemplate our values or have a long reflection – and this channeling is where the danger lies.

You can maybe use models like GPT-3 or similar to help with brainstorming or summarizing or writing, but this is not where most of the difficulty or speed-up comes from. If your definition of ‘automating alignment’ includes speeding up alignment researchers running experiments then Codex already does this, but this does not mean that we will solve alignment in time.

John Wentworth (reviewed)

In theory, we know of one safe outer objective for automating alignment research: simulate alignment researchers in some environment. However, there are many issues with this in practice. For example, if you want to train a generative model on a bunch of existing data and use this to generate a Paul Christiano post from 2040, it needs to generalize extremely well to faithfully predict what Paul will write about alignment in 2040. However, we also need to avoid having it predict (perhaps accurately) that the most likely outcome is that there is an unaligned AGI in 2040 faking a Paul post.

In general, when you move away from ‘just simulating’ people to something else that applies more optimization pressure, things fail in subtle ways. If we are pretty close to solving alignment already, then we don’t have to apply too much optimization pressure – going from a 50% chance of solving the problem in time to a 100% chance is just 1 bit of optimization, but going from 1 in a million to 100% is a much harder task and is much more dangerous.

It is very hard to know how close we are to solving the alignment because the alignment community is still quite confused about the problem. The obvious way to reduce the amount of optimization pressure we need to apply is to do more alignment research ourselves such that the gap between the starting point of optimization and the goal is smaller. The optimization we apply by directly doing alignment research is safer insofar as we have introspective access to the processes that produce our insights, and can check if we expect them to reliably lead to good outcomes.

Some AI-assisted tools like autocomplete or improved Google Scholar could be useful, but the bottom line is that we can’t really have the AI do the hard parts without confronting the problems arising from powerful optimization.

One possible way to get around these problems is to leverage the safety of ‘simulate alignment researchers in a stable environment’ by running this safe simulation very fast. If we had arbitrary technical capabilities at our disposal, this might work. However, our current technology, generative models, would not work even if scaled up. This is because they make predictions about a conditional world, not a counterfactual world. What we really want is to put our alignment researchers in a counterfactual world where ‘unaligned AGI takes over’ is much less likely but people are still working on the alignment problem. This would mean that when we ask for a Paul post from 2040, we get a Paul post that actually solves alignment rather than one that was written by an unaligned AGI.

Evan Hubinger (reviewed)

Generative models can be very powerful, and constitute a type of intelligence that is not inherently goal-directed. GPT-3 provided evidence that not every intelligent system is (or approximates) a coherent agent. If we can be sure that we have built a powerful generative model (and not a system that appears to behave like a generative model during training) then we should be able to get it to safely and productively produce alignment research.

The hard part is ensuring that it really is a generative model – i.e. that it really is just simulating the processes that generated its training data. Inner alignment is the main problem in this framing: there may be pressures in the training process that mean systems that get sufficiently low loss on the training objective no longer act as pure simulators and instead implement some kind of consequentialism.

Ethan Perez (reviewed)

We should be trying to automate alignment research with AI systems. It’s not clear that getting useful alignment work out of AI systems requires levels of intelligence that are necessarily misaligned or power-seeking. It’s not clear in which order ‘capable of doing useful stuff’ and ‘deceptively aligned’ arise in these systems – current models can talk competently about deception but are not themselves deceptive. It remains to be seen whether building assistants that can help solve the alignment problem is easier or harder than directly building an alignment strategy that holds all the way to superintelligence.

However, we don’t currently know the best way to use powerful AI systems to help with alignment, so we should be building lots of tools that can have more powerful AI ‘plugged in’ when it is available. We should be a little careful about building a tool that is also useful for capabilities, but capabilities don’t pay as much attention to the alignment community as we sometimes imagine, similar ideas are already out there, and we can capture a lot of value by building it early.

Richard Ngo (edited and endorsed by Richard)

Having AI systems help with alignment in some capacity is an essential component of the long-term plan. The most likely path to superintelligence involves a lot of AI assistance. So "using AIs to align future AIs’ is less of a plan than a description of the default path – the question is which alignment proposals help most in aligning the AIs that will be doing that later work. I feel pretty uncertain about how dangerous the first AIs capable of superhuman alignment research will be, but tend to think that they'll be significantly less power-seeking than humans are.

It’s hard to know in advance specifically what ‘automating alignment’ will look like except taking our best systems and applying them as hard as we can; so the default way to make progress here is just to keep doing regular alignment research to build a foundation we can automate from earlier. For example, if mechanistic interpretability research discovers some facts about how transformers work, we can train on these and use the resulting system to discover new facts.

Conclusion

There is no consensus on how much automating alignment research can speed up progress. In hindsight, it would have been good to get more quantitative estimates of the type of speed-up each person expected to be possible. There seems to be sufficient uncertainty that investigating the possibility further makes a lot of sense, especially given the lack of current clear paths to an alignment solution. In future posts I will aim to go into more detail on some proposed mechanisms by which alignment could be accelerated.

I ended up talking to Ian mostly about the difficulties of simulating/predicting alignment researchers because I expected that to be the topic with the most important gaps in coverage from the other people he talked to. Having read this post, there's a different class of "automating alignment research" proposals which I want to talk about more. The general pattern is: study the process of science, figure out some key bottleneck thoroughly enough to make it legible, and then automate that bottleneck.

Example 1: Measurement Devices

I started to explain this and it turned into a whole post. Short version: any measurement device automates a part of the research process (the part where a human observes something). But a "thermometer" which automatically mimicked a human sticking their finger in something and reporting its hotness in natural language would be a lot less useful than the thermometers which we actually use. Most of the value comes, not from the automation, but from noticing some robust pattern in the world: the fact that we can use a single number ("temperature") to reproducibly and precisely predict a broad class of interactions (e.g. which of two things will get hotter/colder when the two things are put in contact).

Example 2: "AI Feynman"

AI Feynman is a project out of Max Tegmark's lab. The main interesting part is to automatically notice certain kinds of structure, like locality or additivity, in data. It's the sort of thing where you could feed it measurement-streams from hundreds of sensors in a piston, and it might rediscover the Ideal Gas Law, or at least figure out that pressure, volume, and temperature summarize all the key information.

I don't necessarily think AI Feynman itself is going revolutionize anything, but I could imagine something like it being a big deal. The key is to automate the step of science where we look at some very-high-dimensional real world stuff interacting, and back out the relatively-low-dimensional parameters which actually matter for the relatively-long-range interactions.

Example 3: Automated Inter-Researcher Interfaces

Occasionally I've heard proposals to automate distillation, or rubber ducking, or even writing up new research. In principle, I think there's a lot of potential there. In practice, I think the vast majority of proposals start from e.g. "Here's some neat ML tech, how could we apply it to distillation?" rather than "What are the main ways good distillations provide value, and how can we reduce the cost of that?". I'm much more optimistic about people starting with the nail than the hammer. Going one step further: don't just start with a story about how good distillations provide value, go find some actual distillations which are actually providing lots of value and study those; consider what their key load-bearing features are. Maybe try writing a few yourself, to better understand which steps are difficult and also double-check whether the key load-bearing features you identified are indeed sufficient to generate a high-value distillation. Do all that before asking about the hammer.

The General Pattern

The general pattern in these: deeply understand a particular bottleneck to the scientific process. Once we understand it deeply and legibly enough, automation should be straightforward. The main failure mode, in all cases, is to jump into automating without really understanding what it is that we're automating.

Science and General Intelligence

Now, this sort of strategy is not easy. Lots of people have tried to make the scientific process more legible over the past couple centuries, and most of them have done a pretty shit job. (Karl Popper gets a special callout for doing a completely shit job, but being a sufficiently successful salesperson that his shitty model of "the scientific method" was basically what my science teachers taught me in middle school.)

But the study of science overlaps particularly well with the study of general intelligence; insights in one usually correspond directly to insights in the other. For instance, compare these two questions:

How do scientists notice the few summary variables which matter for physical laws (like e.g. pressure, volume, temperature of a gas) when faced with a very-high-dimensional real-world system?
How do generally intelligent systems figure out which few variables to remember, pay attention to, etc, when faced with very-high-dimensional sensor data?

Or:

How do scientists narrow down the exponentially large search space of possible physical laws?
How do generally intelligent systems narrow down the exponentially large search space of possible world models?

Because of this correspondence, I expect that insights into general intelligence will produce corresponding insights into how to do science better. Indeed, insofar as research into general intelligence doesn't produce insights into how to do science better, it's probably on the wrong track.

I kinda want to hear the level 2 comments where people just disagree with each others' takes :P

My personal thoughts on the matter these days are about training data. Minerva shows that you can solve really tricky problems and help humans, if you just have a finetuning set of 5 billion tokens of people solving similar (and even harder) problems clearly and correctly.

What can/can't you get that quality of training data on that would help alignment research? I think the bad news for Level 2 here is that we don't even have 5 billion tokens of alignment research period (the alignment research dataset is something like 100M tokens), and the fraction of it that consists of clear and correct solutions to problems is quite small.

So either you content yourself with Level 1 tools like writing / coding assistants trained on broad data, or you get a few orders of magnitude better at learning how to generate useful ideas from limited data, which sounds... concerning.

What does Loom refer to? Not Loom.com, the service for recording video snippets of your screen, right?

It's a tool for interacting with language models: https://generative.ink/posts/loom-interface-to-the-multiverse/

Example 1: Measurement Devices

Example 2: "AI Feynman"

Example 3: Automated Inter-Researcher Interfaces

The General Pattern

Science and General Intelligence

How do scientists notice the few summary variables which matter for physical laws (like e.g. pressure, volume, temperature of a gas) when faced with a very-high-dimensional real-world system?
How do generally intelligent systems figure out which few variables to remember, pay attention to, etc, when faced with very-high-dimensional sensor data?

Or:

How do scientists narrow down the exponentially large search space of possible physical laws?
How do generally intelligent systems narrow down the exponentially large search space of possible world models?

I kinda want to hear the level 2 comments where people just disagree with each others' takes :P

What does Loom refer to? Not Loom.com, the service for recording video snippets of your screen, right?

It's a tool for interacting with language models: https://generative.ink/posts/loom-interface-to-the-multiverse/

107

Beliefs and Disagreements about Automating Alignment Research

107

Ω 46

Introduction

What does ‘automating alignment research’ even mean?

Current state of automating alignment

Key disagreements

Generative models vs agents

Timing of emergence of deception vs intelligence

The ‘hardness’ of generating alignment insights

The benefits (in terms of time saved) of Level 1 interventions

Summaries of viewpoints on automating alignment research

Nate Soares (unreviewed)

John Wentworth (reviewed)

Evan Hubinger (reviewed)

Ethan Perez (reviewed)

Richard Ngo (edited and endorsed by Richard)

Conclusion

107

Ω 46

Example 1: Measurement Devices

Example 2: "AI Feynman"

Example 3: Automated Inter-Researcher Interfaces

The General Pattern

Science and General Intelligence

107

Ω 46

Example 1: Measurement Devices

Example 2: "AI Feynman"

Example 3: Automated Inter-Researcher Interfaces

The General Pattern

Science and General Intelligence