mishajw — LessWrong

Training fails to elicit subtle reasoning in current language models

While recent AI systems achieve strong performance through human-readable reasoning that should be simple to monitor (OpenAI, 2024, Anthropic, 2025), we investigate whether models can learn to reason about malicious side tasks while making that reasoning appear benign. We find that Sonnet 3.7 can learn to evade either a reasoning...

Oct 9, 202549

Sabotage Evaluations for Frontier Models

by David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez, and Buck

This is a linkpost for a new research paper from the Alignment Evaluations team at Anthropic and other researchers, introducing a new suite of evaluations of models' abilities to undermine measurement, oversight, and decision-making. Paper link. Abstract: > Sufficiently capable models could subvert human oversight and decision-making in important contexts....

Oct 18, 202495

How well do truth probes generalise?

Representation engineering (RepEng) has emerged as a promising research avenue for model interpretability and control. Recent papers have proposed methods for discovering truth in models with unlabeled data, guiding generation by modifying representations, and building LLM lie detectors. RepEng asks the question: If we treat representations as the central unit,...

Feb 24, 202496

Jailbreaking GPT-4 with the tool API

WARNING: Contains mentions of racism, sexism, homophobia, and violence. We don't include the bad GPT-4 completions in this post, we link out to them. OpenAI’s API has a tool API. Basically, when you prompt a model for a completion, you can also give it the option to “use a tool”....

Feb 21, 202420

Distilled Representations Research Agenda

by Hoagy and mishajw

Introduction Schematic of the basic training strategy I’ve recently been given funding from the Long Term Future Fund to develop work on an agenda I'll tentatively call Distilled Representations, and I'll be working on this full-time over the next 6 months with Misha Wagner (part time). We're working on a...

Oct 18, 202215