Jack Clark adds: "the Frontier Red Team got cloned almost immediately" by other labs.
Wait what? I didn't hear about this. What other companies have frontier red teams? Where can I learn about them?
I think he’s just referring to DC evals, and I think this is wrong because I think other companies doing evals wasn’t really caused by Anthropic (but I could be unaware of facts).
Edit: maybe I don't know what he's referring to.
DC evals got started in summer of '22, across all three leading companies AFAICT. And I was on the team that came up with the idea and started making it happen (both internally and externally), or at least, as far as I can tell we came up with the idea -- I remember discussions between Beth Barnes and Jade Leung (who were both on the team in spring '22), and I remember thinking it was mostly their idea (maybe also Cullen's?) It's possible that they got it from Anthropic but it didn't seem that way to me. Update: OK, so apparently @evhub had joined Anthropic just a few months earlier [EDIT this is false evhub joined much later, I misread the dates, thanks Lawrence] -- it's possible the Frontier Red Team was created when he joined then, and information spread to the team I was on (but not to me) about it. I'm curious to find out what happened here, anyone wanna weigh in?
At any rate I don't think there exists any clone or near-clone of the Frontier Red Team at OpenAI or any other company outside Anthropic.
Evan joined Anthropic in late 2022 no? (Eg his post announcing it was Jan 2023 https://www.alignmentforum.org/posts/7jn5aDadcMH6sFeJe/why-i-m-joining-anthropic)
I think you’re correct on the timeline, I remember Jade/Jan proposing DC Evals in April 2022, (which was novel to me at the time), and Beth started METR in June 2022, and I don’t remember there being such teams actually doing work (at least not publically known) when she pitched me on joining in August 2022.
It seems plausible that anthropic’s scaring laws project was already under work before then (and this is what they’re referring to, but proliferating QA datasets feels qualitatively than DC Evals). Also, they were definitely doing other red teaming, just none that seem to be DC Evals
We did the the 80% pledge thing, and that was like a thing that everybody was just like, "Yes, obviously we're gonna do this."
Does anyone know what this is referring to? (Maybe a pledge to donate 80%? If so, curious about 80% of what & under what conditions.)
All of the founders committed to donate 80% of their equity. I heard it's set aside in some way but they haven't donated anything yet. (Source: an Anthropic human.)
This fact wasn't on the internet, or rather at least wasn't easily findable via google search. Huh. I only find Holden mentioning 80% of Daniela's equity is pledged.
Thanks for posting this. Editing feedback: I think the post would look quite a bit better if you used headings and LW quotes. This would generate a timestamped and linkable table of contents, and also more clearly distinguish the quotes from your commentary. Example:
the US treats the Constitution as like the holy document—which I think is just a big thing that strengthens the US, like we don't expect the US to go off the rails in part because just like every single person in the US is like The Constitution is a big deal, and if you tread on that, like, I'm mad. I think that the RSP, like, it holds that thing. It's like the holy document for Anthropic. So it's worth doing a lot of iterations getting it right.
<your commentary>
Tiny editing issue: "[] everyone in the company can walk around and tell you []" -> The parentheses are empty. Maybe these should be for italicized formatting?
It's so much better if everyone in the company can walk around and tell you what are the top goals of the RSP, how do we know if we're meeting them, what AI safety level are we at right now—are we at ASL-2, are we at ASL-3—that people know what to look for because that is how you're going to have good common knowledge of if something's going wrong.
I like this goal a lot: Good RSPs could contribute to building common language/awareness around several topics (e.g., "if" conditions, "then" commitments, how safety decisions will be handled). As many have pointed out, though, I worry that current RSPs haven't been concrete or clear enough to build this kind of understanding/awareness.
One interesting idea would be to survey company employees and evaluate their understanding of RSPs & the extent to which RSPs are having an impact on internal safety culture. Example questions/topics:
One of my concerns about RSPs is that they (at least in their current form) don't actually achieve the goal of building common knowledge/awareness or improving company culture. I suspect surveys like this could prove me wrong– and more importantly, provide scaling companies with useful information about the extent to which their scaling policies are understood by employees, help foster common understanding, etc.
(Another version of this could involve giving multiple RSPs to a third-party– like an AI Safety Institute– and having them answer similar questions. This could provide another useful datapoint RE the extent to which RSPs are clearly/concretely laying out a set of specific or meaningful contributions.)
(Can you edit out all the "like"s, or give permission for an admin to do edit it out? I think in written text it makes speakers sound, for lack of a better word, unflatteringly moronic)
I already edited out most of the "like"s and similar. I intentionally left some in when they seemed like they might be hedging or otherwise communicating this isn't exact. You are free to post your own version but not to edit mine.
Edit: actually I did another pass and edited out several more; thanks for the nudge.
I did something similar when I made this transcript: leaving in verbal hedging particularly in the context of contentious statements etc., where omitting such verbal ticks can give a quite misleading impression.
This is interesting and I'm glad Anthropic did it. I quoted interesting-to-me parts and added some unjustified commentary.
Tom Brown at 20:00
Daniela Amodei at 20:26
Does the new RSP promote "clearer accountability"? I guess a little; per the new RSP:
But mostly the new RSP is just "more flexible and nuanced," I think.
Also, minor:
I don't really understand (like, I can't imagine an example that would be well-described by this) but I'm slightly annoyed because it suggests a vibe of we have made the RSP stronger at least once.
Sam McCandlish at 21:30
Dario Amodei at 22:00
I would agree if the RSP was stronger.
Daniela Amodei at 23:25
Dario adds:
Chris Olah at 24:20
I would agree if the RSP was stronger.
Daniela Amodei at 29:04
I feel bad about this [edit: for reasons I fail to immediately articulate, sorry].
Jared Kaplan at 25:11
Pushes back a little: all of the above were reasons to be excited about the RSP ex ante but it's been surprisingly hard and complicated to determine evals and thresholds; in AI there's a big range where you don't know whether a model is safe. (Kudos.)
Dario Amodei at 41:38
"Race to the top" works in practice:
Jack Clark adds: "the Frontier Red Team got cloned almost immediately" by other labs.
My take: Anthropic gets a little credit for RSP adoption; focus on interp isn't clearly good; Anthropic doesn't get credit for collaboration with AISIs (did it do better than GDM and OpenAI?); on red-teaming I'm not familiar with Anthropic's timeline and interested in takes but, like, I think GDM wrote Model evaluation for extreme risks before any Anthropic-racing-to-the-top on evals.
Daniela Amodei at 42:08
Says customers say they prefer Claude because it's safer (in terms of hallucinations and jailbreaks).
Is it true that Claude is safer? Would be news to me.
Dario Amodei at 48:07
He's excited about places where there's (apparent) consensus, what everyone wise thinks, and then it breaks. He thinks that's about to happen in interp, among other places.
I'd bet against + I wish Anthropic's alignment bets were less interp-y. (But largely because of vibes about what everyone wise thinks.)
(I claim Anthropic's RSP is not very ambitious and is quite insufficient to prevent catastrophe from Anthropic models, especially because ASL-4 hasn't been defined yet but also I worry that the ASL-3 standard will not be sufficient for upper-ASL-3 models. [I'm also skeptical that the RSP is as relevant to most of the staff as this conversation suggests.])