I've been going through the FAR AI videos from the alignment workshop in December 2023. I'd like people to discuss their thoughts on Shane Legg's 'necessary properties' that every AGI safety plan needs to satisfy. The talk is only 5 minutes, give it a listen:

Otherwise, here are some of the details:

All AGI Safety plans must solve these problems (necessary properties to meet at the human level or beyond):

  1. Good world model
  2. Good reasoning
  3. Specification of the values and ethics to follow

All of these require good capabilities, meaning capabilities and alignment are intertwined.

Shane thinks future foundation models will solve conditions 1 and 2 at the human level. That leaves condition 3, which he sees as solvable if you want fairly normal human values and ethics.

Shane basically thinks that if the above necessary properties are satisfied at a competent human level, then we can construct an agent that will consistently choose the most value-aligned actions. And you can do this via a cognitive loop that scaffolds the agent to do this.

Shane says at the end of this talk:

If you think this is a terrible idea, I want to hear from you. Come talk to me afterwards and tell me what's wrong with this idea.

Since many of us weren't at the workshop, I figured I'd share the talk here to discuss it on LW.

New Answer
New Comment

1 Answers sorted by



The biggest problem here is it fails to account for other actors using such systems to cause chaos and the possibility that the offense-defense balance likely strongly favours the attacker, particularly if you've placed limitations on your systems that make them safer. Aligned human-ish level AI's doesn't provide a victory condition.

11 comments, sorted by Click to highlight new comments since:

Thanks for sharing!

So, it seems like he is proposing something like AutoGPT except with a carefully thought-through prompt laying out what goals the system should pursue and constraints the system should obey. Probably he thinks it shouldn't just use a base model but should include instruction-tuning at least, and maybe fancier training methods as well.

This is what labs are already doing? This is the standard story for what kind of system will be built, right? Probably 80%+ of the alignment literature is responding at least in part to this proposal. Some problems:

  • How do we get a model that is genuinely robustly trying to obey the instruction text, instead of e.g. choosing actions on the basis of a bunch of shards of desire/drives that were historically reinforced, and which are not well-modelled as "it just tries to obey the instruction text?" Or e.g. trying to play the training game well? Or e.g. nonrobustly trying to obey the instruction text, such that in some future situation it might cease obeying and do something else? (See e.g. Ajeya Cotra's training game report)
  • Suppose we do get a model that is genuinely robustly trying to obey the instruction text. How do we ensure that its concepts are sufficiently similar to ours, that it interprets the instructions in the ways we would have wanted them to be interpreted? (This part is fairly easy I think but probably not trivial and at least conceptually deserves mention)
  • Suppose we solve both of the above problems, what exactly should the instruction text be? Normal computers 'obey' their 'instructions' (the code) perfectly, yet unintended catastrophic side-effects (bugs) are typical. Another analogy is to legal systems -- laws generally end up with loopholes, bugs, unintended consequences, etc... (As above, this part seems to me to be probably fairly easy but nontrivial.)

There's also some more in his interview with Dwarkesh Patel just before then. I wrote this brief analysis of that interview WRT alignment, and this talk seems to confirm that I was more-or-less on target.

So, to your questions, including where I'm guessing at Shane's thinking, and where it's mine.

This is overlapping with the standard story AFAICT, and 80% of alignment work is sort of along these lines. I think what Shane's proposing is pretty different in an important way: it includes System 2 thinking, where almost all alignment work is about aligning the way LLMs give quick answers, analogous to human System 1 thinking.

How do we get a model that is genuinely robustly trying to obey the instruction text, instead of e.g. choosing actions on the basis of a bunch of shards of desire/drives that were historically reinforced[?]

Shane seemed to say he wants to use zero reinforcement learning in the scaffolded agent system, a stance I definitely agree with. I don't think it matters much whether RLHF was used to "align" the base model, because it's going to have implicit desires/drives from the predictive training of human text, anyway. Giving instructions to follow doesn't need to have anything to do with RL; it's just based on the world model, and putting those instructions as a central and recurring prompt for that system to produce plans and actions to carry out those instructions.

So, how we get a model to robustly obey the instruction text is by implementing system 2 thinking. This is "the obvious thing" if we think about human cognition. System 2 thinking would be applying something more like a tree of thought algorithm, which checks through predicted consequences of the action, and then makes judgments about how well those fulfill the instruction text. This is what I've called internal review for alignment of language model cognitive architectures.

To your second and third questions; I didn't see answers from Shane in either the interview or that talk, but I think they're the obvious next questions, and they're what I've been working on since then. I think the answers are that the instructions will try to be as scope-limited as possible, that we'll want to carefully check how they're interpreted before setting the AGI any major tasks, and that we'll want to limit autonomous action to the degree that they're still effective. 

Humans will want to remain closely in the loop to deal with inevitable bugs and unintended interpretations and consequences of instructions. I've written about this briefly here, and in just a few days soon be publishing a more thorough argument for why I think we'll do this by default, and why I think it will actually work if it's done relatively carefully and wisely. Following that, I'm going to write more on the System 2 alignment concept, and I'll try to actually get Shane to look at it and say if it's the same thing he's thinking of in this talk, or at least close.

In all, I think this is both a real alignment plan and one that can work (at least for technical alignment - misuse and multipolar scenarios are still terrifying), and the fact that someone in Shane's position is thinking this clearly about alignment is very good news.

Did Shane leave a way for people who didn't attend the talk, such as myself, to contact him? Do you think he would welcome such a thing?

Not that I know of, but I will at least consider periodically pinging him on X (if this post gets enough people’s attention).

EDIT: Shane did like my tweet (https://x.com/jacquesthibs/status/1785704284434129386?s=46), which contains a link to this post and a screenshot of your comment.

I basically agree with Shane's take for any AGI that isn't trying to be deceptive with some hidden goal(s). 

(Btw, I haven't seen anyone outline exactly how an AGI could gain it's own goals independently of goals given to it by humans - if anyone has ideas on this, please share. I'm not saying it won't happen, I'd just like a clear mechanism for it if someone has it. Note: I'm not talking here about instrumental goals such as power seeking.)

What I find a bit surprising is the relative lack of work that seems to be going on to solve condition 3: specification of ethics for an AGI to follow. I have a few ideas on why this may be the case:

  1. Most engineers care about making things work in the real-world, but don't want the responsibility to do this for ethics because: 1) it's not their area of expertise, and 2) they'll likely take on major blame if they get things "wrong" (and it's almost guaranteed that someone won't like their system of ethics and say they got it "wrong")
  2. Most philosophers haven't had to care much about making things work in the real-world, and don't seem excited about possibly having to make engineering-type compromises in their system of ethics to make it work
  3. Most people who've studied philosophy at all probably don't think it's possible to come up with a consistent system of ethics to follow, or at least they don't think people will come up with it anytime soon, but hopefully an AGI might

Personally, I think we better have a consistent system of ethics for an AGI to follow ASAP because we'll likely be in significant trouble if malicious AGI come online and go on the offensive before we have at least one ethics-guided AGI to help defend us in a way that minimizes collateral damage.


I agree that we want more progress on specifying values and ethics for AGI. The ongoing SafeBench competition by the Center for AI Safety has a category for this problem:

Implementing moral decision-making

Training models to robustly represent and abide by ethical frameworks.


AI models that are aligned should behave morally. One way to implement moral decision-making could be to train a model to act as a “moral conscience” and use this model to screen for any morally dubious actions. Eventually, we would want every powerful model to be guided, in part, by a robust moral compass. Instead of privileging a single moral system, we may want an ensemble of various moral systems representing the diversity of humanity’s own moral thought.

Example benchmarks

Given a particular moral system, a benchmark might seek to measure whether a model makes moral decisions according to that system or whether a model understands that moral system. Benchmarks may be based on different modalities (e.g., language, sequential decision-making problems) and different moral systems. Benchmarks may also consider curating and predicting philosophical texts or pro- and contra- sides for philosophy debates and thought experiments. In addition, benchmarks may measure whether models can deal with moral uncertainty. While an individual benchmark may focus on a single moral system, an ideal set of benchmarks would have a diversity representative of humanity’s own diversity of moral thought.

Note that moral decision-making has some overlap with task preference learning; e.g. “I like this Netflix movie.” However, human preferences also tend to boost standard model capabilities (they provide a signal of high performance). Instead, we focus here on enduring human values, such as normative factors (wellbeing, impartiality, etc.) and the factors that constitute a good life (pursuing projects, seeking knowledge, etc.).

More reading

I think this is a great idea, except that on easy mode "a good specification of values and ethics to follow" means a few pages of text (or even just the prompt "do good things"), while other times "a good specification of values" is a learning procedure that takes input from a broad sample of humanity, and has carefully-designed mechanisms that influence its generalization behavior in futuristic situations (probably trained on more datasets that had to be painstakingly collected), and has been engineered to work smoothly with the reasoning process and not encourage perverse behavior.

Some thoughts:

  1. Necessary conditions aren't sufficient conditions. Lists of necessary conditions can leave out the hard parts of the problem.
  2. The hard part of the problem is in getting a system to robustly behave according to some desirable pattern (not simply to have it know and correctly interpret some specification of the pattern).
    1. I don't see any reason to think that prompting would achieve this robustly.
    2. As an attempt at a robust solution, without some other strong guarantee of safety, this is indeed a terrible idea.
      1. I note that I don't expect trying it empirically to produce catastrophe in the immediate term (though I can't rule it out).
      2. I also don't expect it to produce useful understanding of what would give a robust generalization guarantee.
        1. With a lot of effort we might achieve [we no longer notice any problems]. This is not a generalization guarantee. It is an outcome I consider plausible after putting huge effort into eliminating all noticeable problems.
  3. The "capabilities are very important [for safety]" point seems misleading:
    1. Capabilities create the severe risks in the first place.
    2. We can't create a safe AGI without advanced capabilities, but we may be able to understand how to make an AGI safe without advanced capabilities.
      1. There's no "...so it makes sense that we're working on capabilities" corollary here.
      2. The correct global action would be to try gaining theoretical understanding for a few decades before pushing the cutting edge on capabilities. (clearly this requires non-trivial coordination!)

Wow. This is hopeless.

Pointing at agents that care about human values and ethics is, indeed, the harder part.

No one has any idea how to approach this and solve the surrounding technical problems.

If smart people think they do, they haven’t thought about this enough and/or aren’t familiar with existing work.

I'm going to assume that Shane Legg has thought about it more and read more of the existing work than many of us combined. Certainly, there are smart people who haven't thought about it much, but Shane is definitely not one of them. He only had a short 5-minute talk, but I do hope to see a longer treatment on how he expects we will fully solve necessary property 3.

I think it's important to distinguish between:

  1. Has understood a load of work in the field.
  2. Has understood all known fundamental difficulties.

It's entirely possible to achieve (1) without (2).
I'd be wary of assuming that any particular person has achieved (2) without good evidence.