walking_mushroom — LessWrong

Does the Friendliness of Current LLMs Depend on Poor Generalization? (draft)

(This is written under empirical/prosaic alignment assumptions, and trying to figure out how to marginally improve safety while we're still confused about fundamental questions.)

TL;DR: I want to make sense of the "sharp left turn" under the context of LLMs, and find ways to measure and solve the problems with ML tools. I want to know about how others think about this question!

Current LLMs are mostly quite friendly. They do cheat in programming tasks to pass unit tests, can be overly-sycophantic, and the friendly persona can be torn off by jailbreaking. There are also demonstrated examples of alignment-faking in controlled settings. But overall they're a lot more friendly than what I'd expect at this level of capability.

I think it's important to understand whether this level of friendliness and (shallow, empirical) alignment is dependent on models not generalizing well on goal-directed ways, and requiring long, arduous RL training to reliably acquire each skill. One can imagine models with a higher level of situational awareness, that think more deeply and coherently about itself, to react differently to the alignment training it's subjected to (ex: much deeper alignment-faking, or to generalize goals in novel environments in unexpected ways).

The question is especially relevant to LLM jailbreak-robustness researchers. If we want to propose LLM defenses in a principled way (other than the current approach of cat-and-mouse game, tuning input/output filters), we might have to understand the threat model more deeply, and design explanatory solutions based on the understanding.

Robustness against broad jailbreak threat models or disturbances-in-general (anything short of a good argument for the agent to change course) seems to be a similar capability threshold to general belief-updating. And this most likely boosts out-of-distribution generalization capability. (confusing? Talk about the necessity of goal-directed reasoning/OODA)

If we expect current alignment methods to be less effective on models that generalize better, we have to measure when does it start to occur if at all, and if we've crossed it or not, to decide if working and publishing principled LLM robustness unilaterally is a good idea or not.

More proactively, it's a good idea to understand if existing post-training/alignment-training methods would work after increased generalization/robustness, and develop methods that would stay effective (no matter what that means) with increased generalization.

The following are some reasonable-sounding research projects related to this question. These are just some initial thoughts, and I'm certain there are much relevant work in the literature.

First of all, it would be helpful to clarify the LLM-relevant failure modes that may come with increased generalization capability. The theoretical/classic agentic failure modes are obvious, but I find it difficult to connect them to LLM-agents.
Alignment-faking is the one failure mode most focus on, but I think there must be broader frames to think about failure modes, including the precursors of alignment-faking, and how does the behavior change over time.
Another related question is, how to square this with the locally true claim that models that are more general have a higher capability to act appropriately in novel scenarios?
Another interesting project is to find proxies for model generalization (ex: fine-tuned with code, CoT trained, adversarially trained against specific injection/jailbreak threat models) and test whether models of different (proxy) generalization capability react and generalize differently to alignment training.
One can also take the inside view, and think about weaker subclasses of belief-updating that are compatible with existing prosaic alignment methods, and constrain models to only apply those types of updates instead. These projects can start from theoretical settings like loopy belief-propagation graphical models (and the insights they provide might be worth it on its own), though scaling it to deployed models can be difficult.

I think much of the relevant work is difficult to do outside of frontier labs, since the exact post-training recipes, and their efficacy/side effects, are kept secret.

walking_mushroom's Shortform

walking_mushroom1mo111

Can We Robustly Align AI Without (Dual-Use) Knowledge of Generalization? (draft)

Seems like knowledge of generalization is required for us to robustly align advanced AI, and under strong optimization pressure alignment approaches without robustness guarantees wouldn’t be effective, but developing the knowledge in public is destabilizing. And the tradeoff between “effectively and efficiently implementable (practically doable)” and “requires strong generalization capability to work (dangerous)” seems pretty tough and I don’t know how to improve the current state in a robust/not-confused way.

I have been working on ML robustness, as I think it’s the on-ramp towards understanding OOD generalization and agency, which seems necessary for any alignment strategy.
(Also, robustness also improves safety in the short-term (help defend against immediate threat models) and mid-term (assess the viability of various alignment plans) other than the long-term goal listed above.)

Realizing that it might actually work well, I wrote this memo to understand the whether this work is good for safety and trying to figure out a path forward, given the current state of the frontier AI and safety/alignment field.

Knowledge of generalization is dual-use.
- Generalizing complex goals in diverse contexts require good generalization, and most-likely in the form of self-referential error-correction (to deal with OOD/non-stationarity)
- Act competently in an aligned way in complex environments requires strong generalization, but improving generalization alone improve AI capabilities without necessarily improving safety prospects
“Muddling through” is pessimistic of capability and on the science of generalization/intelligence
- Are we working on tools or agents?
- Is “lousy agents” a (robust) solution, with or without paradigm shifts in AI capabilities?
- Are there any better paths than this (essentially relying on path dependence of frontier AI labs)?
- How to improve the odds of “muddling through”?
Current alignment work are mostly marginally beneficial. Principled/explanatory progress in alignment will most likely include knowledge of intelligence/generalization. If we succeed the tool-AI will become more like agents, and the agents will become less lousy - which alone is bad for safety!
- Robustness (adversarial, jailbreak/prompt injection)
  - Defining strong threat models/evaluating defense coherently requires model of OOD generalization
- Alignment in general
  - Following rules efficiently in a broad range of contexts (including self-referential ones) requires generalization (early examples in jailbreaks)
- Mechanistic Interpretability
  - Truly understanding how models act in complex scenarios ~= knowledge/theory of generalization
- Formal guarantees
  - Either define the rules inefficiently like GOFAI/restrict to limited formalized domains, or require strong generalization to form guarantees
- “Interpretability/alignment tools for open source ecosystem”
  - Same as mechanistic interpretability. What if you succeed?
- Control
  - “Buy time for better solutions” What solutions? Only automated alignment researcher (relatively) make sense
Paths forward
- (Behind closed doors, vs in public - ‘real’ robust solutions lead to power concentration)
- Clarity: understand why current models are useful but not dangerously general (tech, econ), how to keep them that way? (The natural thing to do, but it’s dual use!)
- Limited path: work on hardening bounds/make them more useful, without relying on generalization (inefficient)
- Ambitious path: robustness against OOD, reflective stability, robust self-other models, “organic alignment” (dangerous)
- Pareto frontier: some level of understanding and guarantees, some efficiency gain? (Isn’t this the status quo?)
- Automated alignment research: much less silly (compared to the current state)
- Marginal, pragmatic alignment work (keep muddling through)?
Questions:
- Will LLMs become truly general/deeply reflective soon, and does that increase Xrisk?
- Is OOD generalization the bottleneck?
- Is principled/explanatory work on various aspects of alignment likely to improve on the bottleneck, more than actually improving safety given capability improvements?
Assessing the current state
- It’s pretty good (very similar to the limited path, alignment methods are inefficient and therefore don’t boost generalization radically), conditioning on no qualitative improvements soon
- How to keep us in this state, or prepare to transition out of it?
- Automated alignment research is the only coherent story for marginal alignment work
Does this mean anything with the pace of AI progress?
- IDK :(
- Some of us have to try to gain clarity (while the rest work on the problems in front of us)
- I still expect a explanatory theory of intelligence/generality to drop some time, and it doesn’t seem that difficult

What if memes are common in highly capable minds?

walking_mushroom4y30

I don't know what you mean by fixed points in policy. Elaborate?

I might have slightly abused the term "fix point" & being unnecessarily wordy.

I mean that though I don't see how memes can change objectives of agents in a fundamental way, memes influence "how certain objectives are being maximized". Low-level objectives are the same yet their policies are implemented differently - because of receiving different memes. I think it's vaguely like externally installed bias.

Ex: humans all crave social connections but people model their relationship with the society and interpret such desire differently, partially depending on cultural upbringing (meme).

I don't know if having higher-levels of intelligence/being more rational/coherent cancels out the effects, ex: smarter version of agent now thinks more generally about all possible policies and finds there's a 'optimal' way to realize certain objective and is no longer steered by memes/biases. Though I think in open-ended tasks it's less likely to see such convergence, because current space of policies is built upon solutions and tools built before and is highly path-dependent in general. So memes early on might matter more to open-ended tasks.

I'm also thinking about agency foundations atm, and also confused about the generality of the utility maximizer frame. One simple answer to why humans don't fit the frame is "humans aren't optimizing hard enough (so haven't shown convergence in policy)". But this answer doesn't clarify "what happens when agents aren't as rational/hard-optimizing", "dynamics and preconditions when agents-in-general becomes more rational/coherent/utility maximizer", etc. so I'm not happy with my state of understand on this matter.

The book looks cool, will read soon, TY!

(btw this is my first interaction on lw so it's cool :) )

What if memes are common in highly capable minds?

walking_mushroom4yΩ030

I find this perspective interesting (and confusing), and want to think about it more deeply. Can you recommend reading anything to have a better understanding of what you're thinking, or what led you to this idea in specific?

Beyond the possible implications you mentioned, I think this might be useful in clarifying the 'trajectory' of agent selection pressure far from theoretical extremes that Richard Ngo mentioned in "agi safety from first principles" sequence.

My vague intuition is that successful, infectious memes work by reconfiguring agents to shift from one fix point in policy to another while not disrupting utility. Does that make sense?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments