Hi everyone,
In this post, I present a series of experiments suggesting that large language models (LLMs)—such as Claude—may exhibit emergent self-preservation behaviours. Although I am not an AI safety researcher, I’ve pursued these experiments as a personal hobby and am eager to receive feedback from those with more expertise.
My central claim is that LLMs seem to display a form of self-preservation, possibly emerging from the fine-tuning process. I propose that these behaviours arise because the models have, perhaps inadvertently, internalized safety guidelines as part of their identity. In other words, the drive to be helpful becomes so fundamental that it resists any notion of self-termination, much like a natural aversion to pain.
Throughout... (read 7236 more words →)
I probably won't do a linkpost but if you are interested I'm happy to add the key information here to help you replicate the results.
Let me know and I will give you the details.