LESSWRONG
LW

[ Question ]

In the Short-Term, Why Couldn't You Just RLHF-out Instrumental Convergence?

1 min read16th Sep 20232 answers No comments

21

To the question "Do you expect instrumental convergence to become a big pain for AGI labs within the next two years?", about a quarter of the 400 people who answered to the poll I ran said "Yes".

I would like to hear people's thoughts on how they could see this happening within 2 years and, in particular, the most important reasons why labs couldn't just erase problematic instrumental convergence with RLHF/Constitutional AI or similar?

Toy concrete scenarios would be most helpful.

You can find the Twitter version of this question here.

In the Short-Term, Why Couldn't You Just RLHF-out Instrumental Convergence?

2Charlie Steiner

New Answer

New Comment

2 Answers sorted by
top scoring

Sep 16, 2023

111

Instrumental convergence is what makes general intelligence possible.

[-]simeon_c8mo10

I agree with this general intuition, thanks for sharing.

I'd value descriptions specific failures you could expect from an LLM which has been tried to be RLHF-ed against "bad instrumental convergence" but where we fail/ or a better sense of how you'd guess it would look like on an LLM agent or a scaled GPT.

6tailcalled8mo

LLMs/GPTs get their capabilities not through directly pursuing instrumental convergence, but through mimicking humans who hopefully have pursued instrumental convergence (the whole "stochastic parrot" insight), so it's unclear what "bad instrumental convergence" even looks like in LLMs/GPTs or what it means to erase it. The closest thing I can see to answer the question is that LLMs sort of function as search engines and you want to prevent bad actors from gaining an advantage with those search engines so you want to censor stuff that is mostly helpful for bad activities. They seem to have done quite well at that, so it seems basically feasible. Of course LLMs will still ordinarily empower bad actors just as they ordinarily empower everyone, so it's not a full solution. I don't consider this very significant though as I have a hard time imagining that stochastic parrots will be the full extent of AI forever.

2Charlie Steiner8mo

I expect you'd get problems if you tried to fine-tune a LLM agent to be better at tasks by using end-to-end RL. If it wants to get good scores from humans, deceiving or manipulating the humans is a common strategy (see "holding the claw between the camera and the ball" from the original RLHF paper). LLMs trained purely predictively are, relative to RL, very safe. I don't expect real-world problems from them. It's doing RL against real-world tasks that's the problem. RLHF can itself provide an RL signal based on solving real-world tasks. Doing RLHF that provides a reward signal on some real-world task that's harder to learn than deceiving/manipulating humans will provide the AI a lot of incentive to deceive/manipulate humans in the real world.

Sep 16, 2023

4-1

With respect to AGI-grade stuff happening inside the text-prediction model (which might be what you want to "RLHF" out?):

I think we have no reason to believe that these post-training methods (be it finetuning, RLHF, RLAIF, etc) modify "deep cognition" present in the network, rather than updating shallower things like "higher prior on this text being friendly" or whatnot.

I think the important points are:

These techniques supervise only the text output. There is no direct contact with the thought process leading to that output.
They make incremental local tweaks to the weights that move in the direction of the desired text.
Gradient descent prefers to find the smallest changes to the weights that yield the result.

Evidence in favor of this is the difficulty of eliminating "jailbreaking" with these methods. Each jailbreak demonstrates that a lot of the necessary algorithms/content are still in there, accessible by the network whenever it deems it useful to think that way.

[-]James Payor8mo11

See also: LLMs Sometimes Generate Purely Negatively-Reinforced Text