Introduction As part of an exchange being facilitated between religion and science, a group of academics has been asked to compile a short description of their greatest scientific achievement/discovery that will be translated into Tibetan and presented to Tibetan Buddhist scholars/monks.[1] I was also invited to contribute, but I sort...
Thanks to Johannes Treutlein, Erik Jenner, Joseph Bloom, and Arun Jose for their discussions and feedback. Summary Monitoring an AI’s internals for features/concepts unrelated to the task the AI appears to be performing may help detect when the AI is performing hidden cognition. For example, it would be very suspicious...
Thanks to Johannes Treutlein for discussions and feedback. Introduction An AI may be able to hide cognition that leads to negative outcomes from certain oversight processes (such as deceptive alignment/scheming). Without being able to detect this hidden cognition, an overseer may not be able to prevent the associated negative outcomes...
Thanks to Jeremy Gillen and Arun Jose for discussions related to these ideas. Summary WARNING: The quality of this post is low. It was sitting in my drafts folder for a while, yet I decided to post it because some people found these examples and analyses helpful in conversations. I...
Thanks to Arun Jose for discussions and feedback. Summary In this short post, we discuss the concept of Internal Target Information within agentic AI systems, arguing that agentic systems possess internal information about their targets. This information, we propose, can potentially be detected and interpreted by an overseer before the...
I'd like to compile a list of potential alignment targets for a sovereign superintelligent AI. By an alignment target, I mean something like what goals/values/utility function we might want to instill in a sovereign superintelligent AI (assuming we've solved the alignment problem). Here are some alignment targets I've come across:...
Thanks to Monte MacDiarmid (for discussions, feedback, and experiment infrastructure) and to the Shard Theory team for their prior work and exploratory infrastructure. Thanks to Joseph Bloom, John Wentworth, Alexander Gietelink Oldenziel, Johannes Treuitlein, Marius Hobbhahn, Jeremy Gillen, Bilal Chughtai, Evan Hubinger, Rocket Drew, Tassilo Neubauer, Jan Betley, and Juliette...