Top postsTop post
By Chengheng Li Chen (AI Safety Barcelona / EPFL) and Kyuhee Kim (MATS / EPFL) Crossposted from the Apart Research Technical AI Governance Challenge, March 2026. Code available at github.com/ChenghengLi/MCLW. Epistemic status: Our results are empirically and theoretically sound. However, paraphrasing attacks from other AI systems may still bypass the...
Benji Berczi, Kyuhee Kim, Cozmin Ududec, James Requeima This is work done by Kyuhee and Benji during MATS Winter 2026, mentored by Cozmin Ududec, and in collaboration with James. TL;DR * Weird generalisation can happen just with prompting, without fine-tuning. Just by adding benign biographical facts (e.g. facts about Hitler...