Akbir Khan — LessWrong

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

by John Hughes, abhayesian, Akbir Khan, and Fabien Roger

In this post, we present a replication and extension of an alignment faking model organism: * Replication: We replicate the alignment faking (AF) paper and release our code. * Classifier Improvements: We significantly improve the precision and recall of the AF classifier. We release a dataset of ~100 human-labelled examples...

Apr 8, 2025148

Automated Researchers Can Subtly Sandbag

by gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, and Fabien Roger

tl;dr When prompted, current models can sandbag ML experiments and research decisions without being detected by zero-shot prompted monitors. Claude 3.5 Sonnet (new) can only sandbag effectively when seeing a one-shot example, while Claude 3.7 Sonnet can do this without an example (zero-shot). We are not yet worried about sabotage...

Mar 26, 202544

Auditing language models for hidden objectives

by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, and evhub

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025155

Debating with More Persuasive LLMs Leads to More Truthful Answers

We've just completed a bunch of empirical work on LLM debate, and we're excited to share the results. If the title of this post is at all interesting to you, we recommend heading straight to the paper. There are a lot of interesting results that are hard to summarize, and...

Feb 7, 202489

Why multi-agent safety is important

Alternative Title: Mo’ Agents, Mo’ Problems This was a semi-adversarial collaboration with Robert Kirk; views are mostly Akbir’s, clarified with Rob’s critiques. Target Audience: Machine Learning researchers who are interested in Safety and researchers who are focusing on Single Agent Safety problems. Context: Recently I’ve been discussing how to meaningfully...

Jun 14, 202210

Why we need prosocial agents

Context: I recently submitted an application to the Open Philanthropy AI Fellowship, and ended up thinking a lot about why studying Multi-Agent Interactions is important. To build machine-learning systems (agents) that are useful in the real world, they need to be able to cooperate with each other and with humans,...

Nov 2, 20217