From Confession to Inhibition: A Temporal Curriculum for Self-Monitoring in Language Models
Abstract Recent work suggests that language models possess latent self-monitoring capacities that are substantially under-elicited by current training methods. Macar et al. (2026) show that post-trained LLMs can detect injected steering vectors through a distributed circuit that emerges during post-training. They further find that preference optimization methods such as DPO...
Apr 251