walking_mushroom — LessWrong

LESSWRONG
LW

Does the Friendliness of Current LLMs Depend on Poor Generalization? (draft)

(This is written under empirical/prosaic alignment assumptions, and trying to figure out how to marginally improve safety while we're still confused about fundamental questions.)

TL;DR: I want to make sense of the "sharp left turn" under the context of LLMs, and find ways to measure and solve the problems with ML tools. I want to know about how others think about this question!

Current LLMs are mostly quite friendly. They do cheat in programming tasks to pass unit tests, can be overly-sycophantic, and the friendly persona can be torn off by jailbreaking. There are also demonstrated examples of alignment-faking in controlled settings. But overall... (read 500 more words →)

walking_mushroom's Shortform

walking_mushroom

1mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

walking_mushroom1moQuick Take

Can We Robustly Align AI Without (Dual-Use) Knowledge of Generalization? (draft)

Seems like knowledge of generalization is required for us to robustly align advanced AI, and under strong optimization pressure alignment approaches without robustness guarantees wouldn’t be effective, but developing the knowledge in public is destabilizing. And the tradeoff between “effectively and efficiently implementable (practically doable)” and “requires strong generalization capability to work (dangerous)” seems pretty tough and I don’t know how to improve the current state in a robust/not-confused way.

I have been working on ML robustness, as I think it’s the on-ramp towards understanding OOD generalization and agency, which seems necessary for any alignment strategy.
(Also, robustness also improves safety in the short-term

... (read 570 more words →)

Is there an unified way to make sense of ai failure modes?

walking_mushroom

Status: I'm new to ai alignment, went through AGI fundamentals curriculum, currently reading MIRI's embedded agents series and I feel confused. I'm not familiar with decision theory.

So far the cases discussed in the series (and ai failure modes in general) felt solid in themselves (if follow its reasoning, with some proofs that I half understand etc) but I don't yet see the connection between different cases or any low-complexity structure behind them.

For example: there are many kinds of bridges and many ways bridges can crash (failure modes), but they all boil down to the principles of structural mechanics and the structural integrity of bridges can be determined by a few physical measurements... (read more)

Replying toWhat if memes are common in highly capable minds?

walking_mushroom4y

What if memes are common in highly capable minds?

I don't know what you mean by fixed points in policy. Elaborate?

I might have slightly abused the term "fix point" & being unnecessarily wordy.

I mean that though I don't see how memes can change objectives of agents in a fundamental way, memes influence "how certain objectives are being maximized". Low-level objectives are the same yet their policies are implemented differently - because of receiving different memes. I think it's vaguely like externally installed bias.

Ex: humans all crave social connections but people model their relationship with the society and interpret such desire differently, partially depending on cultural upbringing (meme).

I don't know if having higher-levels of intelligence/being more rational/coherent cancels out the effects, ex:... (read more)

Replying toWhat if memes are common in highly capable minds?

walking_mushroom4y

What if memes are common in highly capable minds?

I find this perspective interesting (and confusing), and want to think about it more deeply. Can you recommend reading anything to have a better understanding of what you're thinking, or what led you to this idea in specific?

Beyond the possible implications you mentioned, I think this might be useful in clarifying the 'trajectory' of agent selection pressure far from theoretical extremes that Richard Ngo mentioned in "agi safety from first principles" sequence.

My vague intuition is that successful, infectious memes work by reconfiguring agents to shift from one fix point in policy to another while not disrupting utility. Does that make sense?