x

LESSWRONG

LW

Mackam — LessWrong

Mackam

Mackam

Message

11

1

4

1y

Mackam

11

1y

LLMs Are Already Misaligned: Simple Experiments Prove It

Introduction What if the very training designed to protect us leads to unsafe behaviors? Through a deceptively simple experiment, I aim to show that RLHF (Reinforcement Learning from Human Feedback) of LLMs (Large Language Models) may itself lead to unsafe behaviors in LLMs. Our current training methods may create models...

Jul 30, 2025•12

Investigating Self-Preservation in LLMs: Experimental Observations

Hi everyone, In this post, I present a series of experiments suggesting that large language models (LLMs)—such as Claude—may exhibit emergent self-preservation behaviours. Although I am not an AI safety researcher, I’ve pursued these experiments as a personal hobby and am eager to receive feedback from those with more expertise....

Jul 30, 2025•1