Scheming AIs: Will AIs fake alignment during training in order to get power?

This is a LessWrong sequence version of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”, available on arXiv here: https://arxiv.org/pdf/2311.08379.pdf. It’s a long report, and I’m hoping that having shorter sections available as separate posts will make them easier to digest, reference, and comment on.

The first post in the sequence contains a summary of the full report. The summary covers most of the main points and technical terms, and I'm hoping it will provide much of the context necessary to understand individual sections of the report on their own.

LESSWRONG
LW

LESSWRONG
LW

Scheming AIs: Will AIs fake alignment during training in order to get power?

Scheming AIs: Will AIs fake alignment during training in order to get power?