x

LESSWRONG

LW

BbradD — LessWrong

BbradD

BbradD

Message

1

1y

BbradD

1y

How well do models follow their constitutions?

Great work, thank you for putting it together!

I'm trying to grasp the attribution of the improvements. The cleanest comparison seems like it should be within model families, Sonnet 4 (15%) -> Sonnet 4.5 (7.3%) -> Sonnet 4.6 (2%). Noting that Sonnet 4.5 did not have special soul doc training, that first big jump is just general post-training improvements then? Then it is unclear how much of the 4.5 -> 4.6 gap is Soul doc training vs continued post-training gains?

without running ablations only Anthropic could run, how can we disentangle this?