Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://aligned.substack.com/p/alignment-optimism
The post lays out some arguments in favor of OpenAI’s approach to alignment and responds to common objections.
To the extent that alignment research involves solving philosophical problems, it seems that in this approach we will also need to automate philosophy, otherwise alignment research will become bottlenecked on those problems (i.e., on human philosophers trying to solve those problems while the world passes them by). Do you envision automating philosophy (and are you optimistic about this) or see some other way of getting around this issue?
It worries me to depend on AI to do philosophy, without understanding what "philosophical reasoning" or "philosophical progress" actually consists of, i.e, without having solved metaphilosophy. I guess concretely there are two ways that automating philosophy could fail. 1) We just can't get AI to do sufficiently good philosophy (in the relevant time frame), and it turns out to be a waste of time for human philosophers to help train AI philosophy (e.g. by evaluating their outputs and providing feedback) or to try to use them as assistants. 2) Using AI changes the trajectory of philosophical progress in a bad way (due to Goodhart, adversarial inputs, etc.), so that we end up accepting conclusions different from what we would have eventually decided on our own, or just wrong conclusions. It seems to me that humans are very prone to accepting bad philosophical ideas, but over the long run also have some mysterious way of collectively making philosophical progress. AI could exacerbate the former and disrupt the latter.
Curious if you've thought about this and what your own conclusions are. For example, does OpenAI have any backup plans in case 1 turns out to the case, or ideas for determining how likely 2 is or how to make it less likely?
Also, aside from this, what do you think are the biggest risks with OpenAI's alignment approach? What's your assessment of OpenAI leadership's understanding of these risks?
What are the key philosophical problems you believe we need to solve for alignment?
I guess it depends on the specific alignment approach being taken, such as whether you're trying to build a sovereign or an assistant. Assuming the latter, I'll list some philosophical problems that seem generally relevant:
See also this list which I wrote a while ago. I wrote the above without first reviewing that post (to try to generate a new perspective).
Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I'm not sure that it's always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.
However, I'm skeptical that the list of novel philosophical problems we actually need to solve to prevent the most serious misalignment risk will actually be that long. For example, a lot of problems in rationality + decision theory + game theory I'd count more as model capabilities and the moral patienthood questions you can punt on for a while from the longtermist point of view.
How, if at all, does your alignment approach deal with deceptive alignment?
Great post, thanks!
Hmm, my take is “doing good alignment research requires strong consequentialist reasoning, but assisting alignment research doesn’t.” As a stupid example, Google Docs helps my alignment research, but Google Docs does not do strong consequentialist reasoning. So then we get a trickier question of exactly how much assistance we’re expecting here. If it’s something like “helping out on the margin / 20% productivity improvement / etc.” (which I find plausible), then great, let’s do that research, I’m all for it, but we wouldn’t really call it a “plan” or “approach to alignment”, right? By analogy, I think Lightcone Infrastructure has plausibly accelerated alignment research by 20%, but nobody would have ever said that the Lightcone product roadmap is a “plan to solve alignment” or anything like that, right?
The “response” in OP seems to maybe agree with my suggestion that “doing good alignment research” might require consequentialism but “assisting alignment research” doesn’t—e.g. “It seems clear that a much weaker system can help us on our kind of alignment research”. But I feel like the rest of the OP is inconsistent with that. For example, if we’re talking about “assisting” rather than “doing”, then we’re not changing the situation that “alignment research is mostly talent-constrained”, and we don’t have a way to “make alignment and ML research fungible”, etc., right?
There’s also the question of “If we’re trying to avoid our AIs displaying strong consequentialist reasoning, how do we do that?” (Especially since the emergence of consequentialist reasoning would presumably look like “hey cool the AI is getting better at its task”) Which brings us to:
I want to distinguish two things here: evaluation of behavior versus evaluation of underlying motives. The AI doesn’t necessarily have underlying motives in the first place, but if you wind up with an AI displaying consequentialist reasoning, it does. Anyway, when the OP is discussing evaluation, it’s really “evaluation of behavior”, I think. I agree that evaluation of behavior is by-and-large easier than generation. But I think that evaluating underlying motives is hard and different, probably requiring interpretability beyond what we’re capable of today. And I think that if the underlying motives are bad then you can get behavioral outputs which are not just bad in the normal way but adversarially-selected—e.g., hacking / manipulating the human evaluator—in which case the behavioral evaluation part gets suddenly much harder than one would normally expect.
UPSHOT:
Yeah, you could reformulate the question as "how much consequentialist reasoning do you need to do 95% or 99% of the alignment work?" Maybe the crux is in what we mean by consequentialist reasoning. For example, if you build a proof oracle AlphaZero-style, would that be a consequentialist? Since it's trained with RL to successfully prove theorems you can argue it's a consequentialist since it's the distillation of a planning process, but it's also relatively myopic in the sense that it doesn't care about anything that happens after the current theorem is proved. My sense is that in practice it'll matter a lot where you draw your episode boundaries (at least in the medium term), and as you point out there are a bunch of tricky open questions on how to think about this.
I agree with your evaluation of behavior point. I also agree that the motives matter but an important consideration is whether you picture them coming from an RM (which we can test extensively and hopefully interpret somewhat) or some opaque inner optimizers. I'm pretty bullish on both evaluating the RM (average case + adversarially) and the behavior.
Thank you for writing this! I've been trying to consolidate my own thoughts around reward modeling and theoretical v. empirical alignment research for a long time, and this post and the discussion has been very helpful. I'll probably write that up as a separate post later, but for now I have a few questions:
The simulators framing suggests that RLHF doesn't teach things to a LLM, instead LLM can instantiate many agents/simulacra, but does it haphazardly, and RLHF picks out particular simulacra and stabilizes them, anchoring them to their dialogue handles. So from this point of view, LLM could well have all human values, but isn't good enough to channel stable simulacra that exhibit them, and RLHF stabilizes a few simulacra that do so in the ways useful for their roles in a bureaucracy.
I think this is a misunderstanding, and that approximately zero MIRI-adjacent researchers hold this belief (that good alignment research must be the product of good consequentialist reasoning). What seems more true to me is that they believe that better understanding consequentialist reasoning -- e.g., where to expect it to be instantiated, what form it takes, how/why it "works" -- is potentially highly relevant to alignment.
I think you might be misunderstanding Jan's understanding. A big crux in this whole discussion between Eliezer and Richard seems to be: Eliezer believes any AI capable of doing good alignment research—at least good enough to provide a plan that would help humans make an aligned AGI—must be good at consequentialist reasoning in order to generate good alignment plans. (I gather from Nate's notes in that conversation plus various other posts that he agrees with Eliezer here, but not certain.) I strongly doubt that Jan just mistook MIRI's focus on understanding consequentialist reasonsing for a belief that alignment research requires being a consequentialist reasoner.
I think you're right - thanks for this! It makes sense now that I recognise the quote was in a section titled "Alignment research can only be done by AI systems that are too dangerous to run".
This caused me to find your substack! Sorry I missed it earlier, looking forward to catching up.