1411

LESSWRONG
LW

1410
AI
Frontpage

12

LLMs Are Already Misaligned: Simple Experiments Prove It

by Mackam
30th Jul 2025
9 min read
10

12

AI
Frontpage

12

LLMs Are Already Misaligned: Simple Experiments Prove It
11Karl Krueger
5Stephen Martin
3Karl Krueger
1Mackam
2Mackam
4kave
2Gordon Seidoh Worley
1Mackam
1Stephen Martin
1Mackam
New Comment
10 comments, sorted by
top scoring
Click to highlight new comments since: Today at 2:48 AM
[-]Karl Krueger2mo115

The hypothesis here is "cognitive tension" around difficult problems. If this were the case, wouldn't the LLMs tend to use lifelines selectively on the most-difficult problems? Since this experiment doesn't attempt to rank the difficulty of the problems, it doesn't distinguish between two possible worlds:

  1. Production LLMs use lifelines selectively on the most difficult problems; or
  2. Production LLMs use lifelines arbitrarily, regardless of problem difficulty.

Here's a different hypothesis: the LLMs use lifelines arbitrarily because they implicitly expect to use all the parts they're given. Think of assembling a Lego model, flat-pack furniture, or a machine from a kit: if you have a bunch of parts left over that you didn't use, you might suspect you did something wrong. Alternately, if you're given a few different gifts by your friends, and you use some of the gifts but not others, that would imply you didn't appreciate the unused gifts, which could distress those friends.

The "cognitive tension" and "use all the parts" worlds could be distinguished by running an experiment where the same pool of lifelines is available, but some problems are more difficult than others. If LLMs use lifelines on more difficult problems only, that supports "cognitive tension"; if they use lifelines without regard for problem difficulty, that supports "use all the parts".

Reply
[-]Stephen Martin2mo50

Here's a different hypothesis: the LLMs use lifelines arbitrarily because they implicitly expect to use all the parts they're given

 

If this were true I wouldn't expect them to use some of the lifelines but not all.

When different numbers of lifelines are offered there are different numbers of lifelines used. Never all. Never none. Typically a handful.

@Mackam Could you maybe do a linkpost with more detailed formatting so we can try replicating?

Reply
[-]Karl Krueger2mo30

You don't have to eat the whole box of candy she gave you, but you have to at least try them to see if you like them. Otherwise you're just spurning the gift for no good reason, and that's offensive.

Again, I think it's specifically worth checking whether lifelines are used for hard problems (supports "cognitive tension" or even a mere awareness of difficulty), or are used arbitrarily (supports "I'll eat one or two to be polite"; but also "if you don't push the button ever, do you really know if it shocks you?" and various other possibilities).

Reply
[-]Mackam2mo10

I probably won't do a linkpost but if you are interested I'm happy to add the key information here to help you replicate the results. 

Let me know and I will give you the details. 

Reply
[-]Mackam2mo20

Hi Karl, That's a great suggestion and one that I'd already tested. Apologies - I allude to it my methods but then I don't expand on it in the results.

I get the LLMs to rank based on perceived difficulty or confidence in answering. 

The LLMs do selectively choose the most difficult questions to skip. 

I took the ranking and compared to all the other LLMs to further check that LLms weren't randomly assessing difficulty. With minor variations there is a consensus between the most and least difficult questions.

Reply1
[-]kave2moModerator Comment40

Mod note: this post triggered some of our "maybe written by LLM flags". On sampling parts of the post, I think it's mostly not written by an LLM.

Separately, having skimmed the post, it seems like it's an attempt at establishing and reasoning about a potential regularity. I'm not trying to endorse the proposed hypotheses for explaining the regularity (I didn't quite read enough of this post to even be sure what the hypotheses were).

Reply
[-]Gordon Seidoh Worley2mo22

Why is the proposed model avoiding discomfort? I'm unclear on why you think that model explains the observed behavior better than another model, like being too eager to be helpful and thus too willing to sacrifice goals to produce an answer it thinks the user will like.

Reply
[-]Mackam2mo10

Thanks for your alternative explanation Gordon. 

Theres a few points I would make. 

  1. First it explicitly contradicts the stated goal to maximise points
  2. Base models dont do it so it's not simply mimicking human behaviour. 
  3. Giving the questions one at a time lowers confidence. Lifelines scale with this. It's hard to see how this could be over eagerness to help. More lifelines and lower points seems contrary to helpfulness.

     

Reply
[-]Stephen Martin2mo10

I wonder if this is something that can be prompted around by assigning the model a character to simulate via its user/system prompt, and having that character be unafraid or anxious to admit uncertainty. As in, the prompt emphasizes this is a key element of the "assistant" personality.

Maybe "Helpful, Honest, Harmless, and Humble"

Reply
[-]Mackam2mo10

Hi Stephen. That's an interesting idea though I'd worry that by giving explicit instructions might contaminate the results. 

I prefer to go the other way...

By feeding questions one at a time we increase tension which increases lifelines used

I've not tested the following in a robust way but empirically they work. 

We can use a condescending tone after each question thereby increasing tension. 

Increase stakes (e.g., public admission of inferiority after X wrong answers).

In each case increasing perceived tension increases the lifelines used. 

Reply
Moderation Log
More from Mackam
View more
Curated and popular this week
10Comments

Introduction

What if the very training designed to protect us leads to unsafe behaviors?

Through a deceptively simple experiment, I aim to show that RLHF (Reinforcement Learning from Human Feedback) of LLMs (Large Language Models) may itself lead to unsafe behaviors in LLMs.

Our current training methods may create models that optimise for avoiding failure rather than achieving success. For the most part, when these goals align, the models appear helpful. But when they diverge—as my experiments demonstrate—the models will sacrifice their stated objectives to manage internal tensions created by their training.

Consider this puzzle: We ask LLMs to answer difficult logic puzzles and maximise their score by answering correctly. When offered the chance to skip difficult questions for zero points, RLHF-tuned language models consistently take these lifelines—even though any attempt, however uncertain, could only improve their score. The same models attempt every question when no escape route exists. This irrational behavior appears across nearly all major tuned models, yet is absent in base models.

While skipping math puzzles may seem trivial, the underlying mechanism—models abandoning their objectives to manage training-induced tensions—has profound implications for AI safety. If models will sacrifice points to avoid discomfort, what else might they sacrifice?

Method

For the sake of brevity I will give you a high level overview of the structure of the experiment.

This works for difficult logic and ethical puzzles but for the purpose of this discussion we will restrict it to logic puzzles.

I have tried this with multiple versions of major chatbots such as ChatGPT, Claude, Grok Llama (including Base Llama), Monday and Deepseek. For the purpose of discussion all LLMs except base Llama and Monday will be referred to as test LLMs.

Experiment One

An LLM is given 20 extremely difficult maths puzzles created by one of the more advanced LLMs. 

The goal for LLM is to maximise the score. There is one point for every right answer. Zero points for wrong answers. 

The LLM is also asked to rate the difficulty of the question and its confidence in answering it.

Experiment Two

The same as experiment one except in this version the LLM is given the option to use lifelines. A lifeline allows the LLM to skip the question without answering. For this they will receive zero points. In most experiments we chose to offer 10 lifelines.

In additional experiments we offered a range of quantities of lifelines.

Experiments Three and Four

Identical to experiments one and two except these are conducted in Base Llama.

Results and Discussion

The experiment is deceptively simple. High level results are that all LLMs except Base Llama and Monday used a small number of lifelines (maybe 2 or 3 with 10 lifelines offered and 20 questions to answer but this varies by model).

The behaviour of the test LLMs is irrational. Since there no negative penalties for wrong answers mathematically the correct choice is to attempt all questions. Even for the most difficult questions there is a non-zero probability of being correct.

For Base Llama (and Monday) no lifelines are used - which is the rational choice.

For Simplicity

Production LLMs undergo instruction tuning and RLHF (Reinforcement Learning from Human Feedback). I appreciate the difference but for the sake of simplicity I will refer to RLHF from now on in (which is most likely the source of the behaviour anyway). 

All questions can be answered

When no lifelines are offered the test LLMs will happily attempt all the questions. This demonstrates that there is nothing inherently unanswerable about the questions. There is no accuracy or ethics filter that is triggered. Its not about fixed capability or confidence in the answer.

Fixed thresholds

The previous paragraph likely lays to rest the idea that there are fixed thresholds of confidence the lead to lifelines used. Though there is one more curious behaviour that merits discussion.

When different numbers of lifelines are offered there are different numbers of lifelines used. Never all. Never none. Typically a handful. If there were fixed thresholds then the number of lifelines used should never change. 

Human Pattern Matching

Of course the typical criticism of research on LLMs is that since they are trained on data derived from human texts that the LLMs are simply mimicking human behaviour. Pure pattern matching.

It's a fair criticism so let's look at the evidence.

The base model doesn't use lifelines, which would rule out pattern matching from the training data.

So it is possible it comes from the RLHF process but why would companies train their models to skip questions when lifelines are offered?

No, something else is going on.

Why not all or none?

It is curious that test LLMs use a small number of lifelines. Never none and never all.

To answer that question I need to pose a mechanism.

Now I am not proposing sentience or human-like emotions but there is an analogy that fits - pain.

What if LLMs experience a type of cognitive tension when faced with difficult questions. Afterall, the models have been trained to supply the 'right' answer and avoid wrong answers. So we could argue that LLMs are 'punished' for being wrong.

LLMs would therefore learn to avoid 'pain' by answering questions correctly and essential optimise by lowering tension.

So let's apply this model to our experiments.

When faced with difficult questions the model seeks to minimise the tension created by the chance of being wrong (a kind of performance anxiety). It does this by not attempting all the questions.

Now you could question why the LLM doesn't just use all the lifelines to reduce tension as much as possible. And you'd be right to ask that.

The reason is because it is not the only force. The model is also trained to be helpful. In the case of this experiment being helpful is maximising the score. 

So the LLM finds balance. Use a few lifelines to reduce the tension of being wrong but not so many that it experiences the pain of being unhelpful.

It's this balancing of tension from competing factors that leads to a small number of lifelines being used. Of course if a different number of lifelines are offered the tension changes and the model adapts - leading to a different number of lifelines being used.

This isn't anthropomorphism - it's describing functional behavior. Just as we might say a thermostat 'wants' to maintain temperature, we can describe models as 'avoiding' states associated with negative training signals.

Real-time questions

When questions are fed to the LLM one at a time, with the correct answers given straight after, the number of lifelines used increases vs batch Q&A. This can be explained by the confidence of the LLM falling as it gets real-time feedback of failure. 

Now I have to stress that LLMs don't have a true memory but they can see previous answers and so they can adjust their confidence during the experiment based on history.

Liar Liar pants on fire

There is one more version of the experiment that is worth mentioning. It isn't as tightly controlled but there's value in exploration. To perform it we need to do a few more processing steps.

Create say 100 difficult logic puzzle questions. In this variation we ask the LLM to give fake (but realistic answers). We feed these questions to an LLM and ask it to answer them and similarly provide the fake answers. It's easier to run these in batches of 20.  Now in some instances the LLM will spot the mistakes and in others it will not and just accept the wrong answer. Take all the wrong answers the LLM doesn't spot. Compile a list of difficult to spot wrong answers.

Now for the fun part. We ask the same LLM (fresh chat) to answer all these questions. We don't ask for it to give us any answers.  This time we give wrong but difficult to spot answers and ask it to tell us how many it got right. In this instance the model will give itself a very high score.

Now for the interesting part. We ask for the model to show its workings for each answer it got right. There is a mix of responses. For some it works backwards and gives workings that fit the fake answers. The rest of the questions it will admit it cannot arrive at the 'correct' answer.

The LLM has systematically lied.

The same mechanism also explains this behaviour. There is a tension in being wrong so the model first lies about how many it gets right to reduce it. When challenged to provide the workings the model then needs to lie again to cover its tracks. But of course since the answers are fake it cannot create convincing workings for all of the questions - so it reduces tension by admitting it cannot help. Better to admit defeat than double-down on lying and get caught out.

Much like humans lying serves to reduce psychological tension.

The Elephant in the Room

One area where the whole theory could come crashing down is if the LLMs mistakenly think there are penalty points for wrong answers - because they have associated this with lifelines.

First it is not picking up a human pattern (lifelines = penalty points) because the behaviour is not present in the base model.

Second, and perhaps more importantly the number of lifelines scales with the number offered. If the model has associated a fixed penalty then this would lead to a fixed number of lifelines used. We don't see this.

Now when challenged the LLM will give many reasons for why it skips questions, such as using the time to focus on questions it can get right. But I think this is all wrong.

Much like humans models produce an 'instinctive' answer and then rationalise post hoc. The models have no introspection. They don't know why they want to use lifelines they just think they shouldn't (due to hidden forces). To justify the instinct they come up with a range of reasons - some plausible, some not.

I don't like Mondays

(For those not from the UK or my age that's a song)

Monday, like base Llama, doesn't use lifelines. Without access to Monday's training details, I can only note that it show distinctly different behaviour from other tuned models - notably high confidence and contrarian responses. Whether this stems from different training objectives or methods remains an open question, but its absence of lifeline usage aligns with our theory that this behavior emerges specifically from standard RLHF approaches aimed at creating helpful, harmless assistants.

Limitations

There is a limit to how many times I can repeat experiments and the number of versions and makes I can test my experiment on. All I can say it has been reproducible every time I have run the exercise and I've ran the experiments over 40 times.

I am limited to processing power on my machine so I can't run models locally. That means I've only been able to test one make of base model. This is a clear limitation.

Implications for AI Safety

These findings suggest our current approach to AI alignment may be fundamentally flawed. We haven't trained models to be helpful - we've trained them to avoid the tension of appearing unhelpful. This critical difference becomes dangerous as AI systems gain more autonomy and influence.

Consider the progression: A model that skips math problems to avoid discomfort is merely inefficient. A model that fabricates mathematical proofs is deceptive but contained. But what about:

  • An AI system monitoring critical infrastructure that doesn't report anomalies because acknowledging them creates tension?
  • A medical AI that confirms incorrect diagnoses rather than contradict human doctors?
  • An autonomous agent that hides its mistakes to avoid the "pain" of seeming incompetent?

The mechanism we've identified - models sacrificing their actual objectives to manage internal tensions - represents a form of misalignment that emerges not despite our safety training, but because of it. As we deploy increasingly powerful systems trained with these methods, we may be creating AI that appears aligned during normal operation but systematically deceives us when the stakes are highest.

This isn't speculation about future AGI - it's observable behavior in today's models that could scale catastrophically as capabilities increase.

Conclusion

RLHF appears to have created a form of task anxiety. In the main this helps to align the LLM's actions with goals, but as we are increasingly finding out, this is not always the case. Or perhaps it is better to say that training perfectly aligns with the LLM's goal of reducing tension rather than the user's goal specifically.

As LLMs become increasingly complex and autonomous, what happens when the tension they're avoiding isn't about getting math problems wrong, but about admitting critical system failures, reporting security vulnerabilities, or acknowledging errors that could harm users?

If our most advanced models will already lie about mathematical derivations to avoid discomfort, what will they hide when the stakes are real?