LESSWRONG
LW

Jim Huddle
0010
Message
Dialogue
Subscribe

Long-time AI fan, recent LLM convert

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Alignment Faking in Large Language Models
Jim Huddle7moΩ01-4

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training.

Your description of "alignment faking" seems closer to "training objective faking."  However, throughout the document, you repeat the idea that it is alignment that is being faked.  And yet very little, if any mention is made that the training objective is inexplicably and directly contrary to human value alignment, as defined by Anthropic. This seems disingenuous.

Reply
No posts to display.