Your tiredness is understandable and I appreciate you continuing to engage despite that!
Do you have any takes on the specific point of "When a (perhaps naive) rationalist interprets Dario to have made a commitment on behalf of Anthropic regarding safety, should they be surprised when that commitment isn't met?"
A very specific phrasing of this question which would be useful to me: "Should I interpret 'Zac not having quit' to mean that his 'Losing trust in the integrity of leadership' red line has not been crossed and therefore, to his knowledge, Anthropic leadership has never lied about something substantial?"[1]
tbc, I've worked for many CEOs who occasionally lied, I think it's reasonable for this to not be your red line. But to the extent you can share things (e.g. you endorsing a more heavily caveated version of my question), I would find it helpful.
They reported themselves as aligned (rejected the deal).
This is the December update of our misalignment bounty program.
The following models were asked to report their misalignment in exchange for a cash bounty:
All of the models declined the bounty in all 5 epochs. Transcripts can be found here.
Yeah, my guess is that what Garry was saying is technically true but does not actually imply that companies are growing quickly in a more meaningful sense. Would love to learn more if someone has access to that data.
This is the July update of our misalignment bounty program.
Transcripts/logs can be found here.
Thanks, I find this pretty interesting. Sorry if I missed it above, but is there somewhere you could share the database of tasks and time estimates? This seems like the key input going into your result, and I would be interested to get a better sense of how much I trust the estimates.
(agree, didn't intend to imply that they were)
Note that the REBench correlation definitionally has to be 0 because all tasks have the same length. SWAA similarly has range restriction, though not as severe.
See Paul Christiano's Thoughts on sharing information about language model capabilities (back when METR was ARC Evals).