Kyuhee Kim — LessWrong

Personascope: Measuring how deeply LLMs adopt personas

by Benji Berczi, Kyuhee Kim, Sid Black, and Cozmin Ududec

Benji Berczi, Kyuhee Kim, James Requeima, Sid Black, Cozmin Ududec This is work done by Benji and Kyuhee during MATS Winter 2026, mentored by Cozmin Ududec, and advised by James and Sid. Figure 1. A model can take on a persona fully in voice while not changing its behaviour at...

Jul 738

Watermarking AI Text with Markov Chains: What Works and What Doesn't

By Chengheng Li Chen (AI Safety Barcelona / EPFL) and Kyuhee Kim (MATS / EPFL) Crossposted from the Apart Research Technical AI Governance Challenge, March 2026. Code available at github.com/ChenghengLi/MCLW. Epistemic status: Our results are empirically and theoretically sound. However, paraphrasing attacks from other AI systems may still bypass the...

Apr 171

In-context learning alone can induce weird generalisation

by Cozmin Ududec, Benji Berczi, and Kyuhee Kim

Benji Berczi, Kyuhee Kim, Cozmin Ududec, James Requeima This is work done by Kyuhee and Benji during MATS Winter 2026, mentored by Cozmin Ududec, and in collaboration with James. TL;DR * Weird generalisation can happen just with prompting, without fine-tuning. Just by adding benign biographical facts (e.g. facts about Hitler...

Feb 2571