Message

danwil

Message

Most Current Model Organisms Leak: Perplexity Differencing Often Reveals Finetuning Objectives

by Luca Baroni, Mo Abu Baker, and danwil

Authors: Mohammad Abu Baker, Luca Baroni, Daniel Wilhelm Paper: https://arxiv.org/abs/2605.00994 Code: https://github.com/z3research/ppldiff-paper Twitter thread: https://x.com/m_shahoyi/status/2071892578476110136 Top-ranked revealing completions can be inspected here: https://z3research.org/ This post summarizes the paper and adds a few extra reflections in Discussion TL;DR * We found that many current publicly available model organisms (MOs) "leak" instilled...

Jul 1•26

Widening AI Safety's talent pipeline by meeting people where they are

by Ruben Castaing, yanni kyriacos, Nelson Gardner-Challis, and danwil

Summary The AI safety field has a pipeline problem: many skilled engineers and researchers are locked out of full‑time overseas fellowships. Our answer is the Technical Alignment Research Accelerator (TARA) — a 14‑week, part‑time program designed for talented professionals and students who can’t put their careers or studies on hold...

Sep 25, 2025•34

Tokenized SAEs: Infusing per-token biases.

by tdooms and danwil

tl;dr * We introduce the notion of adding a per-token decoder bias to SAEs. Put differently, we add a lookup table indexed by the last seen token. This results in a Pareto improvement across existing architectures (TopK and ReLU) and models (on GPT-2 small and Pythia 1.4B). Attaining the same...

Aug 4, 2024•20