This is Shane Legg, cofounder of DeepMind, on Dwarkesh Patel's podcast. The link is to the ten-minute section in which they specifically discuss alignment. Both of them seem to have a firm grasp on alignment issues as they're discussed on LessWrong.

For me, this is a significant update on the alignment thinking of current leading AGI labs. This seems more like a concrete alignment proposal than we've heard from OpenAI or Anthropic. Shane Legg has always been interested in alignment and a believer in X-risks. I think he's likely to play a major role in alignment efforts at DeepMind/Google AI as they approach AGI.

Shane's proposal centers on "deliberative dialogues", DeepMind's term for a system using System 2 type reasoning to reflect on the ethics of the actions it's considering. 

This sounds exactly like the the internal review I proposed in Capabilities and alignment of LLM cognitive architectures and Internal independent review for language model agent alignment. I could be squinting too hard to get his ideas to match mine, but they're at least in the same ballpark. He's proposing a multi-layered approach, like I do, and with most of the same layers. He includes RLHF or RLAIF as useful additions but not full solutions, and human review of its decision processes (externalized reasoning oversight as proposed by Tamera Lanham, now at Anthropic).

My proposals are explicitly in the context of language model agents, (including their generalization to multimodal foundation models). It sounds to me like this is the type of system Shane is thinking of when he's talking about alignment, but here I could easily be projecting. His timelines are still short, though, so I doubt he's envisioning a whole new type of system prior to AGI.[1] 

Dwarkesh pushes him on the challenges of both getting a ML system to understand human ethics. Shane says that's challenging; he's aware that giving a system any ethical outlook at all is nontrivial.  I'd say this aspect of the problem is well on the way to being solved;  GPT4 understands a variety of human ethical systems rather well, with proper prompting. Future systems will understand human conceptions of ethics better yet. Shane recognizes that just teaching a system about human ethics isn't enough; there's a philosophical challenge in choosing the subset of that ethics you want the system to use.

Dwarkesh also pushes him on how you'd ensure that the system actually follows its ethical understanding. I didn't get a clear understanding from his answer here, but I think it's a complex matter of designing the system so that it performs an ethics review and then actually uses it to select actions. This could be in a scripted scaffold around an agent, like AutoGPT, but this could also apply to more complex schemes, like an RL outer loop network running a foundation model. Shane notes the problems with using RL for alignment, including deceptive alignment.

This seems like a good starting point to me, obviously; I'm delighted to see that someone whose opinion matters is thinking about this approach. I think this is not just an actual proposal, but a viable one. It doesn't solve The alignment stability problem[2] of making sure stays aligned once it's autonomous and self-modifying, but I think that's probably solvable, too, once we get some more thinking on it.

The rest of the interview is of interest as well; it's Shane's thoughts on the path to AGI, which I think is quite reasonable, well-expressed, and one plausible path; DeepMind's contributions to safety vs. alignment, and his predictions for the future.


  1. ^

    When asked about the limitations of language models relative to humans, he focused on their lack of episodic memory. Adding this in useful form to an agent isn't trivial, but it seems to me it doesn't require any breakthroughs relative to the vector databases and knowledge graph approaches already in use. This is consistent with but not strong evidence for Shane thinking that foundation model agents are the path to AGI.

  2. ^

    Edit: Value systematization: how values become coherent (and misaligned) is another way to think about part of what I'm calling the alignment stability problem.

New to LessWrong?

New Comment
20 comments, sorted by Click to highlight new comments since: Today at 5:28 AM

To my ears it sounded like Shane's solution to "alignment" was to make the models more consequentialist. I really don't think he appreciates most of the difficulty and traps of the problems here. This type of thinking, on my model of their models, should make even alignment optimists unimpressed, since much of the reason for optimism lies in observing current language models, and interpreting their outputs as being nonconsequentialist, corrigible, and limited in scope yet broad in application.

Consequentialism is not uniformly bad. I think the specific way Shane wants to make the models more consequentialist defends against some failure modes. Deep Deceptiveness is essentially about humans being misled because the module that checks for safety of an action is shallow, and the rest of the model is smarter than it and locally pointed at goals that are more difficult to satisfy without misleading the operators. If the model in this story were fully aware of the consequences of its actions through deliberation, it could realize that modeling the human operators in a different ontology in order to route around them is still bad. (I feel like self-knowledge is more important here though.)

Deliberation also does not have to be consequentialist. The model could deliberate to ensure it's not breaking some deontological rule, and this won't produce instrumental pressure towards a coup.

Would be curious to hear your idea of some of the "difficulty and traps".

Seems a good format for explaining such stuff is a dialogue, due to the probable inferential gap between my model and yours. I’d be happy to have one if you like.

We started a dialogue, which will live here when we post it.

I think it is about making the models more consequentialist, in the sense of making them smarter and more agentic.

I don't see evidence that he's ignoring the hard parts of alignment. And I'm not even sure how optimistic he is, beyond presumably thinking success in a possible.

You could be right in assuming those are his reasons for optimism. That does seem ungenerous, but it could be true. Those definitely aren't my reasons for my limited optimism. See my linked pieces for those. Language model agents are modestly consequentialist and not automatically corrigible, but they have unique and large advantages for alignment. I'm a bit puzzled at the lack and of enthusiasm for that direction; I've only gotten vague criticisms along the lines of "that can't possibly work because there are ways for it to possibly fail". The lines of thought those critiques reference just argue that alignment is hard, not that it's impossible or that a natural language alignment approach couldn't work. So I'm really hoping to get some more direct engagement with these ideas.

He should be rather optimistic because otherwise he probably wouldn't stay at DeepMind.

I also don't remember he said much about the problems of misuse, AI proliferation, and Moloch, as well as the issue of choosing the particular ethics for the AGI, so I take this as small indirect evidence for that DeepMind have a plan similar to OpenAI's "superalignment", i.e., "we will create a cognitively aligned agent and will task it with solving the rest of societal and civilisational alignment and coordination issues".

You could be right, but I didn't hear any hints that he intends to kick those problems down the road to an aligned agent. That's Conjecture's CoEm plan, but I read OpenAIs Superalignment plan as even more vague: make AI better so it can help with alignment, prior to being AGI. Theirs was sort of a plan to create a plan. I like Shane's better, in part because it's closer to being an actual plan.

He did explicitly note that choosing the particular ethics for the AGI is an outstanding problem, but I don't think he proposed solutions, either AI or human. I corrigibility as the central value gives as much time to solve the outer alignment problem as you want (a "long contemplation"), after the inner alignment problem is solved, but I have no idea if his thinking is similar.

I also don't think he addressed misuse, proliferation, or competition. I can think of multiple reasons for keeping them offstage, but I suspect they just didn't happen to make the top priority list for this relatively short interview.

That is very generous. My impression was that Shane Legg does not even know what the alignment problem is, or at least tried to give the viewer the idea that he didn't. His "solution" to the "alignment problem" was to give the AI a better world model and the ability to reflect, which obviously isn't an alignment solution, it's just a capabilities enhancement. Dwarkesh Patel seemed confused for the same reason.

I have more respect than that for Shane. He has been thinking about this stuff for a long time, and my guess is he has some models in the space here (I don't know how good they are, but I am confident he knows the rough shape of the AI Alignment problem). 

See also: 

I don't really listen to podcasts, but this seems pretty important. If someone would be up either just extracting all the relevant quotes, or writing up their own description of what they think Shane's alignment proposal would be, then that would be useful (and I think there is a decent chance we could get Shane to do a quick look over it and say whether it's remotely accurate, since he's commented on LessWrong things before).

That's what I've tried to do here, and I'd be happy to do a more thorough job, including direct quotes. I have a bit of a conflict of interest, since I think his ideas parallel mine closely. I've tried to note areas where I may be reading into his statements, and I can do that more carefully in a longer version.

I made a short clip highlighting how Legg seems to miss an opportunity to acknowledge the inner alignment problem, since his proposed alignment solution seems to be a fundamentally training / black box approach.

When he says "and we should make sure it understands what it says", it could mean "mechanistic understanding", i.e., firing the right circuits and not firing wrong ones. I admit it's a charitable interpretation of Legg's words but it is a possible one.

This is fascinating, because I took the exact same section to mean almost the opposite thing. I took him to focus on making it not a black-box process and not about training but design of a review process that explicitly states the model's reasoning, and is subject to external human review.

He states elsewhere in the interview that RLHF might be slightly helpful, but isn't enough to pin alignment hopes on.

One reason I'm taking this interpretation is that I think DeepMind's core beliefs about intelligence are very different from OpenAIs, even though they've done and are probably doing similar work focused on large training runs. DeepMind initially was working on building an artificial brain, and they pivoted to large training runs in simulated (game) environments as a practical move to demonstrate advances and get funding. I think at least Legg and Hassabis still believe that loosely emulating the brain is an interesting and productive thing to do.

I haven't listened to the whole interview, but it sounds like you might be reading more into it than is there.

Shane talked about the importance of checking the reasoning process given that reinforcement learning can lead to phenomena like deceptive alignment, but he didn't explain exactly how he hopes to deal with this other than saying that the reasoning process has to be checked very carefully.

This could potentially tie to some proposals such as approval-based agents, interpretability or externalized reasoning, but it wasn't clear to me how exactly he wanted to do this. Right now, you can ask an agent to propose a plan and provide a justification, but I'm sure he knows just how unreliable this is.

It's not clear that he has a plan beyond "we'll find a way" (this isn't a quote).

But, as I said, I didn't listen to the whole podcast, so feel free to let me know if I missed anything.

I think you're right that I'm reading into this. But there is probably more to his thinking, whether I'm right or wrong about what that is. Shane Legg was thinking about alignment as far back as his PhD thesis, which doesn't go into depth on it but does show he'd at least read a some of the literature prior to 2008.

I agree that LLM chain of thought is not totally reliable, but I don't think it makes sense to dismiss it as too unreliable to work with for an alignment solution. There's so much that hasn't been tried, both in making LLMs more reliable, and making agents built on top of them reliable by taking multiple paths, and using new context windows and different models to force them to break problems into steps, and use the last natural language statement as their whole context for the next step.

Whether or not this is a reliable path to alignment, it's a potential path to huge profits. So there are two questions: will this lead to alignable AGI? And, will it lead to AGI. I think both are unanswered.

I had a different interpretation of this interview. From what I can tell, Legg's view is that aligning language models is mostly a function of capability. As a result, his alignment techniques are mostly focused on getting models to understand our ethical principles, and getting models to understand whether the actions they take follow our ethical principles by using deliberation. Legg seems to view the problem of getting models to actually want to follow our ethical principles as secondary.

Dwarkesh pushed him multiple times on how we can get models to want to follow our ethical principles. Legg's responses mostly still focused on model capabilities. The closest to answering the question that he got as far as I can tell is that you have to "specify to the system: these are the ethical principles you should follow", and you have to check the reasoning process the model uses to make decisions.

I agree that Legg's focus was more on getting the model to understand our ethical principles. And he didn't give much on how you make a system that follows any principles. My guess is that he's thinking of something like a much-elaborated form of AutoGPT; that's what I mean by a language model agent. You prompt it with "come up with a plan that follows these goals" and then have a review process that prompts with something like "now make sure this plan fulfills these goals". But I'm just guessing that he's thinking of a similar system. He might be deliberately vague so as to not share strategy with competitors, or maybe that's just not what the interview focused on.

I think this is one reasonable interpretation of his comments. But the fact that he:

1. Didn't say very much about a solution to the problem of making models want to follow our ethical principles, and 
2. Mostly talked about model capabilities even when explicitly asked about that problem

makes me think it's not something he spends much time thinking about, and is something he doesn't think is especially important to focus on.

For readers, I want to connect Legg's vision (and your agenda) with OAA:, which recognises "deliberative dialogues"/LM agent alignment as not robustly safe enough for superintelligence capabilities and deep and very consequential (pivotal action-level) plans, but perhaps good enough to task thus-aligned human-level LM agents with accelerating the progress on the OAA agenda itself.