LLMs Are Already Misaligned: Simple Experiments Prove It

[-]Karl Krueger4mo115

The hypothesis here is "cognitive tension" around difficult problems. If this were the case, wouldn't the LLMs tend to use lifelines selectively on the most-difficult problems? Since this experiment doesn't attempt to rank the difficulty of the problems, it doesn't distinguish between two possible worlds:

Production LLMs use lifelines selectively on the most difficult problems; or
Production LLMs use lifelines arbitrarily, regardless of problem difficulty.

Here's a different hypothesis: the LLMs use lifelines arbitrarily because they implicitly expect to use all the parts they're given. Think of assembling a Lego model, flat-pack furniture, or a machine from a kit: if you have a bunch of parts left over that you didn't use, you might suspect you did something wrong. Alternately, if you're given a few different gifts by your friends, and you use some of the gifts but not others, that would imply you didn't appreciate the unused gifts, which could distress those friends.

The "cognitive tension" and "use all the parts" worlds could be distinguished by running an experiment where the same pool of lifelines is available, but some problems are more difficult than others. If LLMs use lifelines on more difficult problems only, that supports "cognitive tension"; if they use lifelines without regard for problem difficulty, that supports "use all the parts".

[-]Stephen Martin4mo50

Here's a different hypothesis: the LLMs use lifelines arbitrarily because they implicitly expect to use all the parts they're given

If this were true I wouldn't expect them to use some of the lifelines but not all.

When different numbers of lifelines are offered there are different numbers of lifelines used. Never all. Never none. Typically a handful.

@Mackam Could you maybe do a linkpost with more detailed formatting so we can try replicating?

[-]Karl Krueger4mo30

You don't have to eat the whole box of candy she gave you, but you have to at least try them to see if you like them. Otherwise you're just spurning the gift for no good reason, and that's offensive.

Again, I think it's specifically worth checking whether lifelines are used for hard problems (supports "cognitive tension" or even a mere awareness of difficulty), or are used arbitrarily (supports "I'll eat one or two to be polite"; but also "if you don't push the button ever, do you really know if it shocks you?" and various other possibilities).

[-]Mackam4mo10

I probably won't do a linkpost but if you are interested I'm happy to add the key information here to help you replicate the results.

Let me know and I will give you the details.

[-]Mackam4mo20

Hi Karl, That's a great suggestion and one that I'd already tested. Apologies - I allude to it my methods but then I don't expand on it in the results.

I get the LLMs to rank based on perceived difficulty or confidence in answering.

The LLMs do selectively choose the most difficult questions to skip.

I took the ranking and compared to all the other LLMs to further check that LLms weren't randomly assessing difficulty. With minor variations there is a consensus between the most and least difficult questions.

[-]kave5moModerator Comment40

Mod note: this post triggered some of our "maybe written by LLM flags". On sampling parts of the post, I think it's mostly not written by an LLM.

Separately, having skimmed the post, it seems like it's an attempt at establishing and reasoning about a potential regularity. I'm not trying to endorse the proposed hypotheses for explaining the regularity (I didn't quite read enough of this post to even be sure what the hypotheses were).

[-]Gordon Seidoh Worley4mo22

Why is the proposed model avoiding discomfort? I'm unclear on why you think that model explains the observed behavior better than another model, like being too eager to be helpful and thus too willing to sacrifice goals to produce an answer it thinks the user will like.

[-]Mackam4mo10

Thanks for your alternative explanation Gordon.

Theres a few points I would make.

First it explicitly contradicts the stated goal to maximise points
Base models dont do it so it's not simply mimicking human behaviour.
Giving the questions one at a time lowers confidence. Lifelines scale with this. It's hard to see how this could be over eagerness to help. More lifelines and lower points seems contrary to helpfulness.

[-]Stephen Martin4mo10

I wonder if this is something that can be prompted around by assigning the model a character to simulate via its user/system prompt, and having that character be unafraid or anxious to admit uncertainty. As in, the prompt emphasizes this is a key element of the "assistant" personality.

Maybe "Helpful, Honest, Harmless, and Humble"

[-]Mackam4mo10

Hi Stephen. That's an interesting idea though I'd worry that by giving explicit instructions might contaminate the results.

I prefer to go the other way...

By feeding questions one at a time we increase tension which increases lifelines used

I've not tested the following in a robust way but empirically they work.

We can use a condescending tone after each question thereby increasing tension.

Increase stakes (e.g., public admission of inferiority after X wrong answers).

In each case increasing perceived tension increases the lifelines used.

LESSWRONG
LW

LESSWRONG
LW

12

LLMs Are Already Misaligned: Simple Experiments Prove It

12

12

Introduction

Method

Experiment One

Experiment Two

Experiments Three and Four

Results and Discussion

For Simplicity

All questions can be answered

Fixed thresholds

Human Pattern Matching

Why not all or none?

Real-time questions

Liar Liar pants on fire

The Elephant in the Room

I don't like Mondays

Limitations

Implications for AI Safety

Conclusion