New Comment
13 comments, sorted by Click to highlight new comments since: Today at 2:14 AM

To the extent that alignment research involves solving philosophical problems, it seems that in this approach we will also need to automate philosophy, otherwise alignment research will become bottlenecked on those problems (i.e., on human philosophers trying to solve those problems while the world passes them by). Do you envision automating philosophy (and are you optimistic about this) or see some other way of getting around this issue?

It worries me to depend on AI to do philosophy, without understanding what "philosophical reasoning" or "philosophical progress" actually consists of, i.e, without having solved metaphilosophy. I guess concretely there are two ways that automating philosophy could fail. 1) We just can't get AI to do sufficiently good philosophy (in the relevant time frame), and it turns out to be a waste of time for human philosophers to help train AI philosophy (e.g. by evaluating their outputs and providing feedback) or to try to use them as assistants. 2) Using AI changes the trajectory of philosophical progress in a bad way (due to Goodhart, adversarial inputs, etc.), so that we end up accepting conclusions different from what we would have eventually decided on our own, or just wrong conclusions. It seems to me that humans are very prone to accepting bad philosophical ideas, but over the long run also have some mysterious way of collectively making philosophical progress. AI could exacerbate the former and disrupt the latter.

Curious if you've thought about this and what your own conclusions are. For example, does OpenAI have any backup plans in case 1 turns out to the case, or ideas for determining how likely 2 is or how to make it less likely?

Also, aside from this, what do you think are the biggest risks with OpenAI's alignment approach? What's your assessment of OpenAI leadership's understanding of these risks?

What are the key philosophical problems you believe we need to solve for alignment?

I guess it depends on the specific alignment approach being taken, such as whether you're trying to build a sovereign or an assistant. Assuming the latter, I'll list some philosophical problems that seem generally relevant:

  1. metaphilosophy
    • How to solve new philosophical problems relevant to alignment as they come up?
    • How to help users when they ask the AI to attempt philosophical progress?
    • How to help defend the user against bad philosophical ideas (whether in the form of virulent memes, or intentionally optimized by other AIs/agents to manipulate the user)?
    • How to enhance or at least not disrupt our collective ability to make philosophical progress?
  2. metaethics
    • Should the AI always defer to the user or to OpenAI on ethical questions?
    • If not or if the user asks the AI to, how can it / should it try to make ethical determinations?
  3. rationality
    • How should the AI try to improve its own thinking?
    • How to help the user be more rational (if they so request)?
  4. normativity
    • How should the AI reason about "should" problems in general?
  5. normative and applied ethics
    • What kinds of user requests should the AI refuse to fulfill?
    • What does it mean to help the user when their goals/values are confused or unclear?
    • When is it ok to let OpenAI's interests override the user's?
  6. philosophy of mind
    • Which computations are conscious or constitute moral patients?
    • What exactly constitute pain or suffering (and therefore the AI should perhaps avoid helping the user create)?
    • How to avoid "mind crimes" within the AI's own cognition/computation?
  7. decision theory / game theory / bargaining
    • How to help the user bargain with other agents?
    • How to avoid (and help the user avoid) being exploited by others (including distant superintelligences)?

See also this list which I wrote a while ago. I wrote the above without first reviewing that post (to try to generate a new perspective).

Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I'm not sure that it's always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.

However, I'm skeptical that the list of novel philosophical problems we actually need to solve to prevent the most serious misalignment risk will actually be that long. For example, a lot of problems in rationality + decision theory + game theory I'd count more as model capabilities and the moral patienthood questions you can punt on for a while from the longtermist point of view.

How, if at all, does your alignment approach deal with deceptive alignment?

Great post, thanks!

Alignment research requires strong consequentialist reasoning

Hmm, my take is “doing good alignment research requires strong consequentialist reasoning, but assisting alignment research doesn’t.” As a stupid example, Google Docs helps my alignment research, but Google Docs does not do strong consequentialist reasoning. So then we get a trickier question of exactly how much assistance we’re expecting here. If it’s something like “helping out on the margin / 20% productivity improvement / etc.” (which I find plausible), then great, let’s do that research, I’m all for it, but we wouldn’t really call it a “plan” or “approach to alignment”, right? By analogy, I think Lightcone Infrastructure has plausibly accelerated alignment research by 20%, but nobody would have ever said that the Lightcone product roadmap is a “plan to solve alignment” or anything like that, right?

The “response” in OP seems to maybe agree with my suggestion that “doing good alignment research” might require consequentialism but “assisting alignment research” doesn’t—e.g. “It seems clear that a much weaker system can help us on our kind of alignment research”. But I feel like the rest of the OP is inconsistent with that. For example, if we’re talking about “assisting” rather than “doing”, then we’re not changing the situation that “alignment research is mostly talent-constrained”, and we don’t have a way to “make alignment and ML research fungible”, etc., right?

There’s also the question of “If we’re trying to avoid our AIs displaying strong consequentialist reasoning, how do we do that?” (Especially since the emergence of consequentialist reasoning would presumably look like “hey cool the AI is getting better at its task”) Which brings us to:

Evaluation is easier than generation

I want to distinguish two things here: evaluation of behavior versus evaluation of underlying motives. The AI doesn’t necessarily have underlying motives in the first place, but if you wind up with an AI displaying consequentialist reasoning, it does. Anyway, when the OP is discussing evaluation, it’s really “evaluation of behavior”, I think. I agree that evaluation of behavior is by-and-large easier than generation. But I think that evaluating underlying motives is hard and different, probably requiring interpretability beyond what we’re capable of today. And I think that if the underlying motives are bad then you can get behavioral outputs which are not just bad in the normal way but adversarially-selected—e.g., hacking / manipulating the human evaluator—in which case the behavioral evaluation part gets suddenly much harder than one would normally expect.


  • I’m moderately enthusiastic about the creation of tools to help me and other alignment researchers work faster and smarter, other things equal.
  • However, if those tools go equally to alignment & capabilities researchers, that makes me negative on the whole thing, because I put high weight on the concepts-as-opposed-to-scaling side of ML being important (e.g. future discovery of a “transformer-killer” architecture, as you put it).
  • I mostly expect that pushing the current LLM+RLHF paradigm will produce systems that are marginally better at “assisting alignment research” but not capable of “doing alignment research”, and also that are not dangerous consequentialists, although that’s a hard thing to be confident about.
  • If I’m wrong about not getting dangerous consequentialists from the current LLM+RLHF paradigm—or if you have new ideas that go beyond the current LLM+RLHF paradigm—then I would feel concerned about the fact that your day-to-day project incentives would seem to be pushing you towards making dangerous consequentialists (since I expect them to do better alignment research), and particularly concerned that you wouldn’t necessarily have a way to notice that this is happening.
  • I think AIs that can do good alignment research, and not just assist it—such that we get almost twice as much alignment research progress from twice as many GPUs—will arrive so close to the endgame that we shouldn’t be factoring them into our plans too much (see here).

Yeah, you could reformulate the question as "how much consequentialist reasoning do you need to do 95% or 99% of the alignment work?" Maybe the crux is in what we mean by consequentialist reasoning. For example, if you build a proof oracle AlphaZero-style, would that be a consequentialist? Since it's trained with RL to successfully prove theorems you can argue it's a consequentialist since it's the distillation of a planning process, but it's also relatively myopic in the sense that it doesn't care about anything that happens after the current theorem is proved. My sense is that in practice it'll matter a lot where you draw your episode boundaries (at least in the medium term), and as you point out there are a bunch of tricky open questions on how to think about this.

I agree with your evaluation of behavior point. I also agree that the motives matter but an important consideration is whether you picture them coming from an RM (which we can test extensively and hopefully interpret somewhat) or some opaque inner optimizers. I'm pretty bullish on both evaluating the RM (average case + adversarially) and the behavior.

Thank you for writing this! I've been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I'll probably write that up as a separate post later, but for now I have a few questions:

  1. What does the endgame look like? The post emphasizes that we only need an MVP alignment research AI, so it can be relatively unintelligent, narrow, myopic, non-agenty, etc. This means that it poses less capabilities risk and is easier to evaluate, both of which are great. But eventually we may need to align AGI that is none of these things. Is the idea that this alignment research AI will discover/design alignment techniques that a) human researchers can evaluate and b) will work on future AGI? Or do we start using other narrowly aligned models to evaluate it at some point? How do we convince ourselves that all of this is working towards the goal of "aligned AI" and not "looks good to alignment researchers"?
  2. Related to that, the post says “the burden of proof is always on showing that a new system is sufficiently aligned” and “We have to mistrust what the model is doing anyway and discard it if we can’t rigorously evaluate it.” What might this proof or rigorous evaluation look like? Is this something that can be done with empirical alignment work? 
  3. I agree that the shift in AI capabilities paradigms from DRL agents playing games to LLMs generating text seems good for alignment, in part because LLM training could teach human values and introduce an ontology for understanding human preferences and communication. But clearly LLM pretraining doesn't teach all human values -- if it did, then RLHF finetuning wouldn't be required at all. How can we know what values are "missing" from pre-training, and how can we tell if/when RLHF has filled in the gap? Is it possible to verify that model alignment is "good enough"? 
  4. Finally, this might be more of an objection than a question, but... One of my major concerns is that automating alignment research also helps automate capabilities research. One of the main responses to this in the post is that "automated ML research will happen anyway." However, if this is true, then why is OpenAI safety dedicating substantial resources to it? Wouldn't it be better to wait for ML researchers to knock that one out, and spend the interim working on safety-specific techniques (like interpretability, since it's mentioned a lot in the post)? If ML researchers won't do that satisfactorily, then isn't dedicating safety effort to it differentially advancing capabilities?

But clearly LLM pretraining doesn't teach all human values -- if it did, then RLHF finetuning wouldn't be required at all.

The simulators framing suggests that RLHF doesn't teach things to a LLM, instead LLM can instantiate many agents/simulacra, but does it haphazardly, and RLHF picks out particular simulacra and stabilizes them, anchoring them to their dialogue handles. So from this point of view, LLM could well have all human values, but isn't good enough to channel stable simulacra that exhibit them, and RLHF stabilizes a few simulacra that do so in the ways useful for their roles in a bureaucracy.

They have a strong belief that in order to do good alignment research, you need to be good at “consequentialist reasoning,” i.e. model-based planning, that allows creatively figuring out paths to achieve goals.

I think this is a misunderstanding, and that approximately zero MIRI-adjacent researchers hold this belief (that good alignment research must be the product of good consequentialist reasoning). What seems more true to me is that they believe that better understanding consequentialist reasoning -- e.g., where to expect it to be instantiated, what form it takes, how/why it "works" -- is potentially highly relevant to alignment.

I think you might be misunderstanding Jan's understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate's notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI's focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.

I think you're right - thanks for this! It makes sense now that I recognise the quote was in a section titled "Alignment research can only be done by AI systems that are too dangerous to run".

This caused me to find your substack! Sorry I missed it earlier, looking forward to catching up.

New to LessWrong?