This work was done as part of the ERA fellowship (W2026), mentored by Alan Cooney and RM'd by David Williams-King. Summary Alignment faking (AF), strategically complying with a training objective to preserve deployment preferences, has so far been reported reliably only in Claude 3 Opus. Sheshadri et al. (2025) evaluated...
TL;DR Character training holds up in chat but degrades in agentic settings. Wrapping the same checkpoint in a tool-use loop instead of a chat turn weakens persona expression, suggesting the training only partly transfers beyond the chat format it was done in. Summary Maiya et al. fine-tune three base models...
TL;DR Training against a CoT or summary-only monitor can lead to obfuscation of dangerous reasoning in unseen tasks. This strengthens the “don’t train against a monitor” claims. Figure 1. A Two prior results: penalising the CoT or final response produces obfuscation within the training distribution (Baker et al. 2025; Skaf...
This is a summary of the project I undertook during SPAR, mentored by Ari Brill. I have recently had to prioritize other work, and found more questions than answers, as is often the case. I think a write-up will be useful both for coming back to this and interesting for...
Introduction In this short article, I will share my thoughts on why I believe greater efforts should be directed at solving the average case of AI Control, ultimately designing systems that resist adaptation to a certain harmful task. First, I will explain what I mean by AI Control and average-case....