Paul Colognese

Decision Transformer Interpretability

TLDR: We analyse how a small Decision Transformer learns to simulate agents on a grid world task, providing evidence that it is possible to do circuit analysis on small models which simulate goal-directedness. We think Decision Transformers are worth exploring further and may provide opportunities to explore many alignment-relevant deep learning phenomena in game-like contexts. Link to the GitHub Repository. Link to the Analysis App. I highly recommend using the app if you have experience with mechanistic interpretability. All of the mechanistic analysis should be reproducible via the app. Key Claims * A 1-Layer Decision Transformer learns several contextual behaviours which are activated by a combination of Reward-to-Go/Observation combinations on a simple discrete task. * Some of these behaviours appear localisable to specific components and can be explained with simple attribution and the transformer circuits framework. * The specific algorithm implemented is strongly affected by the lack of a one-hot-encoding scheme (initially left out for simplicity of analysis) of the state/observations, which introduces inductive biases that hamper the model. If you are short on time, I recommend reading: * Dynamic Obstacles Environment * Black Box Model Characterisation * Explaining Obstacle Avoidance at positive RTG using QK and OV circuits * Alignment Relevance * Future Directions I would welcome assistance with: * Engineering tasks like app development, improving the model, training loop, wandb dashboard etc. and people who can help me make nice diagrams and write up the relevant maths/theory in the app). * Research tasks. Think more about how to exactly construct/interpret circuit analysis in the context of decision transformers. Translate ideas from LLMs/algorithmic tasks. * Communication tasks: Making nicer diagrams/explanations. * I have a Trello board with a huge number of tasks ranging from small stuff to massive stuff. I’m also happy to col

87Feb 6, 2023

Paul Colognese

Message

Personal website

394

Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Introduction As part of an exchange being facilitated between religion and science, a group of academics has been asked to compile a short description of their greatest scientific achievement/discovery that will be translated into Tibetan and presented to Tibetan Buddhist scholars/monks.[1] I was also invited to contribute, but I sort...

Mar 7, 202420

Anomalous Concept Detection for Detecting Hidden Cognition

Thanks to Johannes Treutlein, Erik Jenner, Joseph Bloom, and Arun Jose for their discussions and feedback. Summary Monitoring an AI’s internals for features/concepts unrelated to the task the AI appears to be performing may help detect when the AI is performing hidden cognition. For example, it would be very suspicious...

Mar 4, 202424

Hidden Cognition Detection Methods and Benchmarks

Thanks to Johannes Treutlein for discussions and feedback. Introduction An AI may be able to hide cognition that leads to negative outcomes from certain oversight processes (such as deceptive alignment/scheming). Without being able to detect this hidden cognition, an overseer may not be able to prevent the associated negative outcomes...

Feb 26, 202422

Notes on Internal Objectives in Toy Models of Agents

Thanks to Jeremy Gillen and Arun Jose for discussions related to these ideas. Summary WARNING: The quality of this post is low. It was sitting in my drafts folder for a while, yet I decided to post it because some people found these examples and analyses helpful in conversations. I...

Feb 22, 202416

Internal Target Information for AI Oversight

Thanks to Arun Jose for discussions and feedback. Summary In this short post, we discuss the concept of Internal Target Information within agentic AI systems, arguing that agentic systems possess internal information about their targets. This information, we propose, can potentially be detected and interpreted by an overseer before the...

Oct 20, 202315

Potential alignment targets for a sovereign superintelligent AI

I'd like to compile a list of potential alignment targets for a sovereign superintelligent AI. By an alignment target, I mean something like what goals/values/utility function we might want to instill in a sovereign superintelligent AI (assuming we've solved the alignment problem). Here are some alignment targets I've come across:...

Oct 3, 202329

High-level interpretability: detecting an AI's objectives

Thanks to Monte MacDiarmid (for discussions, feedback, and experiment infrastructure) and to the Shard Theory team for their prior work and exploratory infrastructure. Thanks to Joseph Bloom, John Wentworth, Alexander Gietelink Oldenziel, Johannes Treuitlein, Marius Hobbhahn, Jeremy Gillen, Bilal Chughtai, Evan Hubinger, Rocket Drew, Tassilo Neubauer, Jan Betley, and Juliette...

Sep 28, 202372

Load More (7/14)

LESSWRONG
LW

LESSWRONG
LW

Paul Colognese

Paul Colognese

Paul Colognese

Decision Transformer Interpretability

High-level interpretability: detecting an AI's objectives

Deception?! I ain’t got time for that!

Auditing games for high-level interpretability

Paul Colognese

Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Anomalous Concept Detection for Detecting Hidden Cognition

Hidden Cognition Detection Methods and Benchmarks

Notes on Internal Objectives in Toy Models of Agents

Internal Target Information for AI Oversight

Potential alignment targets for a sovereign superintelligent AI

High-level interpretability: detecting an AI's objectives

Decision Transformer Interpretability

High-level interpretability: detecting an AI's objectives

Deception?! I ain’t got time for that!

Auditing games for high-level interpretability

Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Anomalous Concept Detection for Detecting Hidden Cognition

Hidden Cognition Detection Methods and Benchmarks

Notes on Internal Objectives in Toy Models of Agents

Internal Target Information for AI Oversight

Potential alignment targets for a sovereign superintelligent AI

High-level interpretability: detecting an AI's objectives