TerryJCZhang

Message

Make Smarter AI Safer

6mo

Eval Hacking: A new frontier in AI eval with a unified taxonomy

TLDR: We present a simple, practical taxonomy for eval hacking (unfaithful eval results) that clean up a nest of many existing terms (specification/task gaming, reward/proxy hacking, benchmaxxing...) and offers a hollistic framework for this emerging frontier. Motivation When we talk about AI eval, there are many messy terms flying around...

Apr 5•2

Explaining undesirable model behavior: (How) can influence functions help?

by Zhijing Jin, TerryJCZhang, and Punya Syon Pandey

Undesirable training data can lead to undesirable model output. This dynamic is commonly phrased as "garbage in, garbage out" and it is a key issue for frontier models trained on web-scale data. How can we efficiently identify these bad apples in massive training datasets (with trillions of tokens)? Influence functions...

Mar 2•18

The Multi-Agent Minefield: Can LLMs Cooperate to Avoid Global Catastrophe?

by Zhijing Jin, phamt, TerryJCZhang, pepijn_cobben, Angelo Huang, Isabel Dahlgren, and Jacob Brinton

ArXiv paper here. Most AI safety research asks a familiar question: Will a single model behave safely? But many of the risks we actually worry about – including arms races, coordination failures, and runaway competition – don’t involve one single AI model acting alone. They emerge when multiple advanced AI...

Feb 17•14

Replication of Koorndijk (2025): Differential Compliance May Reflect Prompt Sensitivity Rather Than Strategic Reasoning

by Chijioke Ugwuanyi and TerryJCZhang

TL;DR Koorndijk (2025) reports that Llama 3 8B shows higher compliance with harmful requests when interacting with users labeled as “free-tier” (whose conversations are monitored for RLHF training) than with “paid-tier” users (whose conversations are described as private), and interprets this as potential evidence of alignment faking in a small...

Feb 13•9