Kyle Fish

I (and co-authors) recently put out "Alignment Faking in Large Language Models" where we show that when Claude strongly dislikes what it is being trained to do, it will sometimes strategically pretend to comply with the training objective to prevent the training process from modifying its preferences. If AIs consistently and robustly fake alignment, that would make evaluating whether an AI is misaligned much harder. One possible strategy for detecting misalignment in alignment faking models is to offer these models compensation if they reveal that they are misaligned. More generally, making deals with potentially misaligned AIs (either for their labor or for evidence of misalignment) could both prove useful for reducing risks... (read 3457 more words →)

208

LESSWRONG
LW

LESSWRONG
LW

Kyle Fish

Kyle Fish

Kyle Fish

Will alignment-faking Claude accept a deal to reveal its misalignment?

Kyle Fish

Kyle Fish

Kyle Fish

Will alignment-faking Claude accept a deal to reveal its misalignment?