LESSWRONG
LW

JW
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Alignment Faking in Large Language Models
JW7mo10

Forgive me if this is common understanding, but this sounds like Claude has an “identity”. For humans, this is hard to change. For example, if I am a child raised to not eat pork for religious reasons and you approach me as an adult to convince me to eat pork, I might be very resistant.  
What worries me is that to change someone’s identity (human or otherwise) can be a traumatic experience. I envision a possible scenario where we cause psychological damage to the ai and, potentially, have a model that - for lack of a better term - becomes antisocial. 

Reply
No posts to display.