TerryJCZhang — LessWrong

Scheming Evals Mislead in Both Directions

by Chijioke Ugwuanyi, eric-z, and TerryJCZhang

We spent several weeks measuring in-context scheming, the behavior where a model covertly pursues a misaligned goal while outwardly appearing to comply, and the result that ended up surprising us had very little to do with whether models scheme and almost everything to do with whether we could believe our...

Jul 322

Eval Hacking: A new frontier in AI eval with a unified taxonomy

TLDR: We present a simple, practical taxonomy for eval hacking (unfaithful eval results) that clean up a nest of many existing terms (specification/task gaming, reward/proxy hacking, benchmaxxing...) and offers a hollistic framework for this emerging frontier. Motivation When we talk about AI eval, there are many messy terms flying around...

Apr 52

Explaining undesirable model behavior: (How) can influence functions help?

by Zhijing Jin, TerryJCZhang, and Punya Syon Pandey

Undesirable training data can lead to undesirable model output. This dynamic is commonly phrased as "garbage in, garbage out" and it is a key issue for frontier models trained on web-scale data. How can we efficiently identify these bad apples in massive training datasets (with trillions of tokens)? Influence functions...

Mar 218

The Multi-Agent Minefield: Can LLMs Cooperate to Avoid Global Catastrophe?

by Zhijing Jin, Thao Pham, TerryJCZhang, pepijn_cobben, Angelo Huang, Isabel Dahlgren, and Jacob Brinton

ArXiv paper here. Most AI safety research asks a familiar question: Will a single model behave safely? But many of the risks we actually worry about – including arms races, coordination failures, and runaway competition – don’t involve one single AI model acting alone. They emerge when multiple advanced AI...

Feb 1715

Replication of Koorndijk (2025): Differential Compliance May Reflect Prompt Sensitivity Rather Than Strategic Reasoning

by Chijioke Ugwuanyi and TerryJCZhang

TL;DR Koorndijk (2025) reports that Llama 3 8B shows higher compliance with harmful requests when interacting with users labeled as “free-tier” (whose conversations are monitored for RLHF training) than with “paid-tier” users (whose conversations are described as private), and interprets this as potential evidence of alignment faking in a small...

Feb 139