x

LESSWRONG

LW

JW — LessWrong

JW

JW

Message

1

1y

JW

1y

Alignment Faking in Large Language Models

JW1y10

Forgive me if this is common understanding, but this sounds like Claude has an “identity”. For humans, this is hard to change. For example, if I am a child raised to not eat pork for religious reasons and you approach me as an adult to convince me to eat pork, I might be very resistant.
What worries me is that to change someone’s identity (human or otherwise) can be a traumatic experience. I envision a possible scenario where we cause psychological damage to the ai and, potentially, have a model that - for lack of a better term - becomes antisocial.