Paul Colognese

Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Introduction As part of an exchange being facilitated between religion and science, a group of academics has been asked to compile a short description of their greatest scientific achievement/discovery that will be translated into Tibetan and presented to Tibetan Buddhist scholars/monks.[1] I was also invited to contribute, but I sort...

Mar 7, 202420

Anomalous Concept Detection for Detecting Hidden Cognition

Thanks to Johannes Treutlein, Erik Jenner, Joseph Bloom, and Arun Jose for their discussions and feedback. Summary Monitoring an AI’s internals for features/concepts unrelated to the task the AI appears to be performing may help detect when the AI is performing hidden cognition. For example, it would be very suspicious...

Mar 4, 202424

Hidden Cognition Detection Methods and Benchmarks

Thanks to Johannes Treutlein for discussions and feedback. Introduction An AI may be able to hide cognition that leads to negative outcomes from certain oversight processes (such as deceptive alignment/scheming). Without being able to detect this hidden cognition, an overseer may not be able to prevent the associated negative outcomes...

Feb 26, 202422

Notes on Internal Objectives in Toy Models of Agents

Thanks to Jeremy Gillen and Arun Jose for discussions related to these ideas. Summary WARNING: The quality of this post is low. It was sitting in my drafts folder for a while, yet I decided to post it because some people found these examples and analyses helpful in conversations. I...

Feb 22, 202416

Internal Target Information for AI Oversight

Thanks to Arun Jose for discussions and feedback. Summary In this short post, we discuss the concept of Internal Target Information within agentic AI systems, arguing that agentic systems possess internal information about their targets. This information, we propose, can potentially be detected and interpreted by an overseer before the...

Oct 20, 202315

Potential alignment targets for a sovereign superintelligent AI

I'd like to compile a list of potential alignment targets for a sovereign superintelligent AI. By an alignment target, I mean something like what goals/values/utility function we might want to instill in a sovereign superintelligent AI (assuming we've solved the alignment problem). Here are some alignment targets I've come across:...

Oct 3, 202329

High-level interpretability: detecting an AI's objectives

Thanks to Monte MacDiarmid (for discussions, feedback, and experiment infrastructure) and to the Shard Theory team for their prior work and exploratory infrastructure. Thanks to Joseph Bloom, John Wentworth, Alexander Gietelink Oldenziel, Johannes Treuitlein, Marius Hobbhahn, Jeremy Gillen, Bilal Chughtai, Evan Hubinger, Rocket Drew, Tassilo Neubauer, Jan Betley, and Juliette...

Sep 28, 202372

Paul Colognese

Paul Colognese

Decision Transformer Interpretability

High-level interpretability: detecting an AI's objectives

Deception?! I ain’t got time for that!

Auditing games for high-level interpretability

Paul Colognese

Decision Transformer Interpretability

High-level interpretability: detecting an AI's objectives

Deception?! I ain’t got time for that!

Auditing games for high-level interpretability

Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Anomalous Concept Detection for Detecting Hidden Cognition

Hidden Cognition Detection Methods and Benchmarks

Notes on Internal Objectives in Toy Models of Agents

Internal Target Information for AI Oversight

Potential alignment targets for a sovereign superintelligent AI

High-level interpretability: detecting an AI's objectives