LESSWRONG
LW

J Rosser — LessWrong

+1 to this!

Maybe a neat solution might be to instead use a deploy key (gives read/write access just to one repo) instead of adding a full ssh key?

https://docs.github.com/en/authentication/connecting-to-github-with-ssh/managing-deploy-keys

Speedrunning a Mech Interp Research Setup (Remote GPU, Torch, TransformerLens, Cuda, SSH, VS Code)

J Rosser

12d

I genuinely think this is the fastest way to get set up on a brand-new mech-interp project. It takes you from nothing to a fully working remote GPU dev environment (SSH, VS Code/Cursor, CUDA, PyTorch, TransformerLens, GitHub, and UV) with as little friction as possible. It’s exactly the setup I wish I’d had when starting new projects, and it’s especially handy if you’re working on projects for a Neel Nanda’s MATS stream application and just want to get to research, not infrastructure.

This guide is especially well-suited for:

People starting a brand-new mechanistic interpretability project who want to go from “nothing on a machine” to a working research environment as fast as possible.
Researchers or

... (read 1048 more words →)

Transmitting Misalignment with Subliminal Learning via Paraphrasing

Matthew Bozoukov

Matthew Bozoukov, Taywon Min, CallumMcDougall, J Rosser

2mo

TLDR: We find subliminal learning can occur through paraphrasing datasets, meaning that fine-tuned models can inherit unintended bias from seemingly innocuous data that resembles in-the-wild natural language data. This implies that paraphrasing datasets using biased teachers may be used as an avenue of attack for malicious actors! While the recent Subliminal Learning Across Models study investigates subliminal learning via open-form generation, our work instead focuses on creating realistic datasets that incur bias using paraphrasing. The code and models for this project are here Matthew-Bozoukov/subliminal-learning-paraphrasing.

Introduction

Subliminal learning is a phenomenon where language models transmit traits via semantically unrelated data, such as random number sequences. Specifically, Cloud et al. (2025) show that when (1) a... (read 2702 more words →)

The Case for White Box Control

J Rosser

10mo

In this article, I explore the emerging field of white box control; a set of methods that could detect deception even when it doesn't appear in model outputs.^[1]

What is AI Control?

Please feel free to skip this section if you're already up to speed!

As highlighted by Ben Pace, there are two things you can do to prevent catastrophes caused by misaligned or scheming AI:

Alignment: Ensure that your models aren't scheming.
Control: Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures.

To justify that systems are safe for deployment, we usually form safety cases - structured, evidence-based arguments that a system is safe. In... (read 1288 more words →)