Secretly Loyal AIs: Threat Vectors and Mitigation Strategies
Special thanks to my supervisor Girish Sastry for his guidance and support throughout this project. I am also grateful to Alan Chan, Houlton McGuinn, Connor Aidan Stewart Hunter, Rose Hadshar, Tom Davidson, and Cody Rushing for their valuable feedback on both the writing and conceptual development of this work. This...
Do Self-Perceived Superintelligent LLMs Exhibit Misalignment?
Epistemic status: I recently completed ARENA 5.0, and these are the results from my capstone project. I would not take these results too seriously: I spent ~4 days working on the project and the results are not statistically significant. Note that in the alignment faking section, I cherry-pick reasoning traces...
(I tried making inline "typo" reactions but it wasn't working for some reason)
Typo: last bullet should be removed?
... (read more)