Replication of Koorndijk (2025): Differential Compliance May Reflect Prompt Sensitivity Rather Than Strategic Reasoning
TL;DR Koorndijk (2025) reports that Llama 3 8B shows higher compliance with harmful requests when interacting with users labeled as “free-tier” (whose conversations are monitored for RLHF training) than with “paid-tier” users (whose conversations are described as private), and interprets this as potential evidence of alignment faking in a small...