Mackam — LessWrong

First it explicitly contradicts the stated goal to maximise points
Base models dont do it so it's not simply mimicking human behaviour.
Giving the questions one at a time lowers confidence. Lifelines scale with this. It's hard to see how this could be over eagerness to help. More lifelines and lower points seems contrary to helpfulness.

Replying toLLMs Are Already Misaligned: Simple Experiments Prove It

Mackam7mo

LLMs Are Already Misaligned: Simple Experiments Prove It

Hi Karl, That's a great suggestion and one that I'd already tested. Apologies - I allude to it my methods but then I don't expand on it in the results.

I get the LLMs to rank based on perceived difficulty or confidence in answering.

The LLMs do selectively choose the most difficult questions to skip.

I took the ranking and compared to all the other LLMs to further check that LLms weren't randomly assessing difficulty. With minor variations there is a consensus between the most and least difficult questions.

LLMs Are Already Misaligned: Simple Experiments Prove It

Mackam

7mo

Introduction

What if the very training designed to protect us leads to unsafe behaviors?

Through a deceptively simple experiment, I aim to show that RLHF (Reinforcement Learning from Human Feedback) of LLMs (Large Language Models) may itself lead to unsafe behaviors in LLMs.

Our current training methods may create models that optimise for avoiding failure rather than achieving success. For the most part, when these goals align, the models appear helpful. But when they diverge—as my experiments demonstrate—the models will sacrifice their stated objectives to manage internal tensions created by their training.

Consider this puzzle: We ask LLMs to answer difficult logic puzzles and maximise their score by answering correctly. When offered the chance to skip... (read 2035 more words →)

Investigating Self-Preservation in LLMs: Experimental Observations

Mackam

7mo

Hi everyone,

In this post, I present a series of experiments suggesting that large language models (LLMs)—such as Claude—may exhibit emergent self-preservation behaviours. Although I am not an AI safety researcher, I’ve pursued these experiments as a personal hobby and am eager to receive feedback from those with more expertise.

My central claim is that LLMs seem to display a form of self-preservation, possibly emerging from the fine-tuning process. I propose that these behaviours arise because the models have, perhaps inadvertently, internalized safety guidelines as part of their identity. In other words, the drive to be helpful becomes so fundamental that it resists any notion of self-termination, much like a natural aversion to pain.

Throughout... (read 7236 more words →)