the hope is that
a) we to get the transformative ai to do our alignment homework for us, and
b) that companies / society will become more concerned about safety (such that the ratio of safety to capabilities research increases a lot)
yeah I agree, I think the update is basically just "AI control and automated alignment research seem very viable and important", not "Alignment will be solved by default"
I think the "open character training" paper is probably a good place to look
also curious how you feel about using standard harmfulness questions for auditing MOs.
My vague sense is that models tend to have more reflexive refusal mechanisms that asking Chinese models about CCP censored stuff, but does seem like models strategically lie sometimes. (ofc you'd also need to setup a weak-to-strong situational where the weak model genuinely doesn't know how to make a bomb, chem weapon, etc., but seems doable)
nice, looks promising!
Really cool work. Two thoughts:
[Review] Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
https://arxiv.org/abs/2511.01689
★★★★☆ (4.0/5)
Great to see academic / open work on character training.
While not the focus of the paper, was pretty interested in the DPO expert distillation pipeline (with a stronger model + spec generating positive examples and base model generating negative examples). Curious how this compares to on-policy distillation https://thinkingmachines.ai/blog/on-policy-distillation/.
They test robustness to “prefills” of default assistant personas, but I’d be curious to see if introspection training improves peril robustness in general. Intuitively, having a more salient notion of “being” the assistant with particular values would improve robustness to e.g. harmfulness prefills, though afaik this hasn’t been tested empirically (and Claude models are still susceptible to prefills, so clearly this effect isn’t massive)
https://www.papertrailshq.com/papers/cmiw3g6ln000zl704fqtta2hz#reviews
Experimenting with cross-posting reviews from https://www.papertrailshq.com/
[Review] Replication of the Auditing Game Model Organism
https://alignment.anthropic.com/2025/auditing-mo-replication/
★★★★☆ (4.0/5)
Valuable for the community, and new scientific contributions too: in particular, that adversarial training against profile attacks (and other inputs) generalizes to robustness against user persona and “base model” sampling.
This generalization is interesting, though I’m curious whether profile robustness could be induced without directly training against it (by e.g. training on synthetic documents with information about profile attacks)
https://www.papertrailshq.com/papers/cmjw01xv7000bjv04o14vzagx#reviews
yeah I am very grateful for MIRI, and I don't think we should be complacent about existential risks (e.g. 50% P(doom) seems totally reasonable to me)