Shi

Can Frontier Models Autocomplete Safety Research?

We pose the following research question: how can we measure the "research taste" of language models in experiment planning? What parts of planning taste remain intrinsic to humans? TL;DR. A future where “tasteless autoresearch” improves capabilities but not safety is plausible and dangerous. We need rough tests of tasteful planning...

Jul 1220

Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking

by lennie, joanv, Shi, and Jacob Pfau

The first three sections are written for a general TAIS reader who wants to understand what the state of Debate research is and some high-level takeaways of our work. A reader familiar with Debate may like to skip the setup and start with our presentation of An illustrative training run....

Jul 277

Research agenda: Interpretive debate

TL;DR: * Safety assurance (safety cases, heuristic arguments) involves many interpretive questions about the model’s mechanisms (e.g., motivation). These are fundamental ML problems with a new layer of difficulty due to conceptual ambiguity. * Adversarial robustness and non-human-likeness of models create epistemic challenges in our empirical investigation of interpretive questions:...

Jun 1835

Updates on performative misalignment

by David Vella Zarb, Rustem, Taywon Min, and Shi

Note: this post is an explainer for our recent paper, as well as an update our previous blog post. The work was done by David, Rustem and Taywon under the mentorship of Shi Feng during MATS 9.1, with research management by Jinghua Ou. Scheming or performative scheming? (This section introduces...

Jun 1229

Sycophancy Towards Researchers Drives Performative Misalignment

by Taywon Min, Rustem, David Vella Zarb, and Shi

This work was done by Rustem Turtayev, David Vella Zarb, and Taywon Min during MATS 9.0, mentored by Shi Feng, based on prior work by David Baek. We are grateful to our research manager Jinghua Ou for helpful suggestions on this blog post. TL;DR: in this construct validity exercise, we...

Mar 1810

Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

by Arush, Shawn Zhou, Jiaxin Wen, and Shi

TL;DR * Emergent Misalignment (EM) is correlated with model identity, we find two pieces of evidence for this: * EM suppresses self-recognition capabilities. Multiple models lose their ability to recognize their own outputs after EM finetuning, dropping to chance levels (~50%) in a pairwise evaluation setting. * EM depends on...

Mar 1547

I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

We demonstrate that LLMs can infer information about past personas from a set of nonsensical but innocuous questions and binary answers (“Yes.” vs “No.”, inspired by past work on deception detection) in context, and act upon them in safety-related questions. This is despite the questions bearing no semantic relation to...

Sep 8, 202533

Shi

Shi

Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking

LLM Evaluators Recognize and Favor Their Own Generations

Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Research agenda: Interpretive debate

Shi

Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking

LLM Evaluators Recognize and Favor Their Own Generations

Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

Research agenda: Interpretive debate

Can Frontier Models Autocomplete Safety Research?

Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking

Research agenda: Interpretive debate

Updates on performative misalignment

Sycophancy Towards Researchers Drives Performative Misalignment

Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs