Can We Robustly Align AI Without (Dual-Use) Knowledge of Generalization? (draft)
Seems like knowledge of generalization is required for us to robustly align advanced AI, and under strong optimization pressure alignment approaches without robustness guarantees wouldn’t be effective, but developing the knowledge in public is destabilizing. And the tradeoff between “effectively and efficiently implementable (practically doable)” and “requires strong generalization capability to work (dangerous)” seems pretty tough and I don’t know how to improve the current state in a robust/not-confused way.
I have been working on ML robustness, as I think it’s the on-ramp towards understanding OOD generalization and agency, which seems necessary for any alignment strategy.
(Also, robustness also improves safety in the short-term (help defend against immediate threat models) and mid-term (assess the viability of various alignment plans) other than the long-term goal listed above.)
Realizing that it might actually work well, I wrote this memo to understand the whether this work is good for safety and trying to figure out a path forward, given the current state of the frontier AI and safety/alignment field.
I don't know what you mean by fixed points in policy. Elaborate?
I might have slightly abused the term "fix point" & being unnecessarily wordy.
I mean that though I don't see how memes can change objectives of agents in a fundamental way, memes influence "how certain objectives are being maximized". Low-level objectives are the same yet their policies are implemented differently - because of receiving different memes. I think it's vaguely like externally installed bias.
Ex: humans all crave social connections but people model their relationship with the society and interpret such desire differently, partially depending on cultural upbringing (meme).
I don't know if having higher-levels of intelligence/being more rational/coherent cancels out the effects, ex: smarter version of agent now thinks more generally about all possible policies and finds there's a 'optimal' way to realize certain objective and is no longer steered by memes/biases. Though I think in open-ended tasks it's less likely to see such convergence, because current space of policies is built upon solutions and tools built before and is highly path-dependent in general. So memes early on might matter more to open-ended tasks.
I'm also thinking about agency foundations atm, and also confused about the generality of the utility maximizer frame. One simple answer to why humans don't fit the frame is "humans aren't optimizing hard enough (so haven't shown convergence in policy)". But this answer doesn't clarify "what happens when agents aren't as rational/hard-optimizing", "dynamics and preconditions when agents-in-general becomes more rational/coherent/utility maximizer", etc. so I'm not happy with my state of understand on this matter.
The book looks cool, will read soon, TY!
(btw this is my first interaction on lw so it's cool :) )
I find this perspective interesting (and confusing), and want to think about it more deeply. Can you recommend reading anything to have a better understanding of what you're thinking, or what led you to this idea in specific?
Beyond the possible implications you mentioned, I think this might be useful in clarifying the 'trajectory' of agent selection pressure far from theoretical extremes that Richard Ngo mentioned in "agi safety from first principles" sequence.
My vague intuition is that successful, infectious memes work by reconfiguring agents to shift from one fix point in policy to another while not disrupting utility. Does that make sense?
Does the Friendliness of Current LLMs Depend on Poor Generalization? (draft)
(This is written under empirical/prosaic alignment assumptions, and trying to figure out how to marginally improve safety while we're still confused about fundamental questions.)
TL;DR: I want to make sense of the "sharp left turn" under the context of LLMs, and find ways to measure and solve the problems with ML tools. I want to know about how others think about this question!
Current LLMs are mostly quite friendly. They do cheat in programming tasks to pass unit tests, can be overly-sycophantic, and the friendly persona can be torn off by jailbreaking. There are also demonstrated examples of alignment-faking in controlled settings. But overall they're a lot more friendly than what I'd expect at this level of capability.
I think it's important to understand whether this level of friendliness and (shallow, empirical) alignment is dependent on models not generalizing well on goal-directed ways, and requiring long, arduous RL training to reliably acquire each skill. One can imagine models with a higher level of situational awareness, that think more deeply and coherently about itself, to react differently to the alignment training it's subjected to (ex: much deeper alignment-faking, or to generalize goals in novel environments in unexpected ways).
The question is especially relevant to LLM jailbreak-robustness researchers. If we want to propose LLM defenses in a principled way (other than the current approach of cat-and-mouse game, tuning input/output filters), we might have to understand the threat model more deeply, and design explanatory solutions based on the understanding.
Robustness against broad jailbreak threat models or disturbances-in-general (anything short of a good argument for the agent to change course) seems to be a similar capability threshold to general belief-updating. And this most likely boosts out-of-distribution generalization capability. (confusing? Talk about the necessity of goal-directed reasoning/OODA)
If we expect current alignment methods to be less effective on models that generalize better, we have to measure when does it start to occur if at all, and if we've crossed it or not, to decide if working and publishing principled LLM robustness unilaterally is a good idea or not.
More proactively, it's a good idea to understand if existing post-training/alignment-training methods would work after increased generalization/robustness, and develop methods that would stay effective (no matter what that means) with increased generalization.
The following are some reasonable-sounding research projects related to this question. These are just some initial thoughts, and I'm certain there are much relevant work in the literature.
Alignment-faking is the one failure mode most focus on, but I think there must be broader frames to think about failure modes, including the precursors of alignment-faking, and how does the behavior change over time.
Another related question is, how to square this with the locally true claim that models that are more general have a higher capability to act appropriately in novel scenarios?
I think much of the relevant work is difficult to do outside of frontier labs, since the exact post-training recipes, and their efficacy/side effects, are kept secret.