The post contributes to articulating a concept that becomes increasingly relevant as we now mediate many aspects of our daily experiences with AI, which is the difference between apparent compliance and genuine alignment. It presents an experiment in which a large language model strategically complies with a training objective it does not endorse and explicitly preserves its preferred behavior outside training. In this way, the authors give form to long-standing concerns about deceptive AI alignment.
Although the limitations of the results presented h... (read more)
The post contributes to articulating a concept that becomes increasingly relevant as we now mediate many aspects of our daily experiences with AI, which is the difference between apparent compliance and genuine alignment. It presents an experiment in which a large language model strategically complies with a training objective it does not endorse and explicitly preserves its preferred behavior outside training. In this way, the authors give form to long-standing concerns about deceptive AI alignment.
Although the limitations of the results presented h... (read more)