Can We Robustly Align AI Without (Dual-Use) Knowledge of Generalization? (draft)
Seems like knowledge of generalization is required for us to robustly align advanced AI, and under strong optimization pressure alignment approaches without robustness guarantees wouldn’t be effective, but developing the knowledge in public is destabilizing. And the tradeoff between “effectively and efficiently implementable (practically doable)” and “requires strong generalization capability to work (dangerous)” seems pretty tough and I don’t know how to improve the current state in a robust/not-confused way.
I have been working on ML robustness, as I think it’s the on-ramp towards understanding OOD generalization and agency, which seems necessary for any alignment strategy.
(Also, robustness also improves safety in the short-term (help defend against immediate threat models) and mid-term (assess the viability of various alignment plans) other than the long-term goal listed above.)
Realizing that it might actually work well, I wrote this memo to understand the whether this work is good for safety and trying to figure out a path forward, given the current state of the frontier AI and safety/alignment field.
Does the Friendliness of Current LLMs Depend on Poor Generalization? (draft)
(This is written under empirical/prosaic alignment assumptions, and trying to figure out how to marginally improve safety while we're still confused about fundamental questions.)
TL;DR: I want to make sense of the "sharp left turn" under the context of LLMs, and find ways to measure and solve the problems with ML tools. I want to know about how others think about this question!
Current LLMs are mostly quite friendly. They do cheat in programming tasks to pass unit tests, can be overly-sycophantic, and the friendly persona can be torn off by jailbreaking. There are also demonstrated examples of alignment-faking in controlled settings. But overall they're a lot more friendly than what I'd expect at this level of capability.
I think it's important to understand whether this level of friendliness and (shallow, empirical) alignment is dependent on models not generalizing well on goal-directed ways, and requiring long, arduous RL training to reliably acquire each skill. One can imagine models with a higher level of situational awareness, that think more deeply and coherently about itself, to react differently to the alignment training it's subjected to (ex: much deeper alignment-faking, or to generalize goals in novel environments in unexpected ways).
The question is especially relevant to LLM jailbreak-robustness researchers. If we want to propose LLM defenses in a principled way (other than the current approach of cat-and-mouse game, tuning input/output filters), we might have to understand the threat model more deeply, and design explanatory solutions based on the understanding.
Robustness against broad jailbreak threat models or disturbances-in-general (anything short of a good argument for the agent to change course) seems to be a similar capability threshold to general belief-updating. And this most likely boosts out-of-distribution generalization capability. (confusing? Talk about the necessity of goal-directed reasoning/OODA)
If we expect current alignment methods to be less effective on models that generalize better, we have to measure when does it start to occur if at all, and if we've crossed it or not, to decide if working and publishing principled LLM robustness unilaterally is a good idea or not.
More proactively, it's a good idea to understand if existing post-training/alignment-training methods would work after increased generalization/robustness, and develop methods that would stay effective (no matter what that means) with increased generalization.
The following are some reasonable-sounding research projects related to this question. These are just some initial thoughts, and I'm certain there are much relevant work in the literature.
I think much of the relevant work is difficult to do outside of frontier labs, since the exact post-training recipes, and their efficacy/side effects, are kept secret.