Connor Kissane — LessWrong

Measuring and improving coding audit realism with deployment resources

TL;DR We study realism win rate, a metric for measuring how distinguishable Petri audit transcripts are from real deployment interactions. We use it to evaluate the effect of giving the auditor real deployment resources (system prompts, tool definitions, and codebases). Providing these resources to the auditor increases the average realism...

Mar 2343

Tools to generate realistic prompts help surprisingly little with Petri audit realism

TLDR * We train and many-shot prompt base models to generate user prompts that are harder to distinguish from deployment (WildChat) prompts. * Then we give Petri, an automated auditing agent, a tool to use a prompt generator model for sycophancy audits. It doesn’t help with making the full audit...

Mar 144

White Box Control at UK AISI - Update on Sandbagging Investigations

by Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, and Alan Cooney

Introduction Joseph Bloom, Alan Cooney This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field. The format of this post was inspired by...

Jul 10, 202581

SAEs are highly dataset dependent: a case study on the refusal direction

This is an interim report sharing preliminary results. We hope this update will be useful to related research occurring in parallel. Executive Summary * Problem: Qwen1.5 0.5B Chat SAEs trained on the pile (webtext) fail to find sparse, interpretable reconstructions of the refusal direction from Arditi et al. The most...

Nov 7, 202467

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Intro Anthropic recently released an exciting mini-paper on crosscoders (Lindsey et al.). In this post, we open source a model-diffing crosscoder trained on the middle layer residual stream of the Gemma-2 2B base and IT models, along with code, implementation details / tips, and a replication of the core results...

Oct 27, 202448

Base LLMs refuse too

Executive Summary * Refusing harmful requests is not a novel behavior learned in chat fine-tuning, as pre-trained base models will also refuse requests (48% of all harmful requests, 3% of harmless) just at a lower rate than chat models (90% harmful, 3% harmless) * Further, for both Qwen 1.5 0.5B...

Sep 29, 202461

SAEs (usually) Transfer Between Base and Chat Models

This is an interim report sharing preliminary results that we are currently building on. We hope this update will be useful to related research occurring in parallel. Executive Summary * We train SAEs on base / chat model pairs and find that SAEs trained on the base model transfer surprisingly...

Jul 18, 202467