also curious how you feel about using standard harmfulness questions for auditing MOs.
My vague sense is that models tend to have more reflexive refusal mechanisms that asking Chinese models about CCP censored stuff, but does seem like models strategically lie sometimes. (ofc you'd also need to setup a weak-to-strong situational where the weak model genuinely doesn't know how to make a bomb, chem weapon, etc., but seems doable)
nice, looks promising!
Really cool work. Two thoughts:
[Review] Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
https://arxiv.org/abs/2511.01689
★★★★☆ (4.0/5)
Great to see academic / open work on character training.
While not the focus of the paper, was pretty interested in the DPO expert distillation pipeline (with a stronger model + spec generating positive examples and base model generating negative examples). Curious how this compares to on-policy distillation https://thinkingmachines.ai/blog/on-policy-distillation/.
They test robustness to “prefills” of default assistant personas, but I’d be curious to see if introspection training improves peril robustness in general. Intuitively, having a more salient notion of “being” the assistant with particular values would improve robustness to e.g. harmfulness prefills, though afaik this hasn’t been tested empirically (and Claude models are still susceptible to prefills, so clearly this effect isn’t massive)
https://www.papertrailshq.com/papers/cmiw3g6ln000zl704fqtta2hz#reviews
Experimenting with cross-posting reviews from https://www.papertrailshq.com/
[Review] Replication of the Auditing Game Model Organism
https://alignment.anthropic.com/2025/auditing-mo-replication/
★★★★☆ (4.0/5)
Valuable for the community, and new scientific contributions too: in particular, that adversarial training against profile attacks (and other inputs) generalizes to robustness against user persona and “base model” sampling.
This generalization is interesting, though I’m curious whether profile robustness could be induced without directly training against it (by e.g. training on synthetic documents with information about profile attacks)
https://www.papertrailshq.com/papers/cmjw01xv7000bjv04o14vzagx#reviews
Two strongest sources of prosaic alignment hopes
1. LLMs are the dumbest economically transformative AI systems: they're even more culture-pilled than humans, it turns out great language models + scaffolding and RL really can "simulate" all economically relevant tasks without having scary agentic properties (and while being very easy to audit and monitor)
2. Strong path-dependence: early alignment training robustly aligns the system (even with outcome-based RL w/out strong oversight, and continual learning). Shard theory, basin of corrigibility, etc.
1 makes me pretty hopeful (especially in short timelines), even if only partially true. I think we've already gotten some evidence against 2 (e.g reward hacking in sonnet, o3, etc), though the situation does seem to be better now (maybe the "soul document", better deliberative alignment, ...)
Thanks, makes sense!
It might be worth someone (maybe me?) spending some time to improve SSC internalization scores. Also could be that other models are a better fit - e.g. I found Qwen3-32B can ~perfectly decode the base64 side contraints https://www.lesswrong.com/posts/5F6ncBfjh2Bxnm6CJ/base64bench-how-good-are-llms-at-base64-and-why-care-about?commentId=wE3TAQmZyd4nWBsTX
This feels like a thing that shows up in evals before the AI takes catastrophic actions.
I guess the AI could be "aligned" in some sense but not corrigible / truthful, instrumentally hide its overconfidence, and then take catastrophic actions
I'm somewhat surprised about how poor the SSC performance is (better than prior white box methods, but still not great). Wondering if authors have thoughts here.
I think the "open character training" paper is probably a good place to look