Investigating Priming in Alignment Faking
This work was done as part of the pilot run of RECAP, a version of the ARENA program run in Singapore. I'm very grateful to @Jonathan N and @Clement Neo for creating this program. Code is available here Previous work on alignment faking suggests that models can strategically deceive us...
Jul 9, 202514