Eliezer Yudkowsky: If I’d ever had the faintest, tiniest credence in Anthropic’s “Responsible Scaling Policy”, I’d probably feel pretty betrayed right now!
As it is, I ask only that you update, and not always be surprised in the same direction of “huh, Eliezer was right to call it empty”.
Note: to observe how my cynicism repeatedly *ends up* right, tally only how things *end up*. Don’t jump and say “See, Eliezer was wrong to be cynical!” the moment you hear an uncashed promise or see an arguable sign of later hope.
I would love to update in the direction of "huh, Eliezer was right to call it empty”. However, to do that I need to read the place where Yudkowsky called Anthropic's RSP empty, in advance, and view it in context. Does anyone have a source?
Genuine question! I tried to answer my own question below, and failed to find one. I searched myself, and "asked the shoggoths to search for me", and I think this is a retrodiction masquerading as a prediction. Reader, if you have a source, I welcome it, and will gratefully retract. Absence of evidence is not evidence of absence, and the rest of my comment below is trivially disprovable with a single hyperlink.
Otherwise, this is a long-term pattern as discussed back in 2022: Beware boasting about non-existent forecasting track records. I expect other commenters can provide better documentation of their successful advanced predictions. For example, Zvi has NO bets on the linked market. Unfortunately, in this world, claims of "I told you so" are only credible with links to the places where you told us so.
After writing the above,
I broke out my copy of If Anyone Builds It, Everyone Dies and the "I don't want to be alarmist" chapter. This is the most relevant quote I found:
No company wants to miss out on the money, if a rung is safe. Now consider the sort of corporate executive who has convinced themselves that they alone have the bets chance - 80 percent, say - of shaping the explosion into something that benefits rather than harms humanity. Why they'd think it's imperative they be the first to ascend.
I agree with this, but it's entirely possible for this to be true in a world where Anthropic's RSP is not "empty". In Prisoner's Dilemma, no prisoner wants to miss out on the chance to go free, if they betray the other prisoner, but observing that fact isn't a prediction that all prisoners will defect, and indeed not all prisoners do defect.
Claude Research Mode, looking hard for Yudkowsky making this prediction in any form, found me this: X: Failing to continuously test your AI as it grows into superintelligence, such that it could later just sandbag all interesting capabilities on its first round of evals, is a relatively less dignified way to die. Any takers besides Anthropic?
So "relatively less dignified way to die" is arguably pointing out that lab-specific policies and country-specific policies are insufficient, and we need an international treaty. But something can be insufficient without being "empty". A vegan diet is grossly insufficient to end animal suffering, but it's not empty.
Yudkowsky's position in Re: recent Anthropic safety research is skeptical towards its accuracy, but says that people should go on looking hard for early manifestations of arguable danger. He also says that Anthropic management play clever PR games, but without saying that Anthropic's RSP in particular is a clever PR game. It's not like Yudkowsky is carefully avoiding saying anything negative about Anthropic for his own clever PR games. X: Anthropic straight-up wouldn't do moderately-bad stuff to vulnerable users, I think. That's not the road down which Dario Amodei walks into Hell.
Broadening the scope, what about MIRI? Well MIRI's April 2024 Newsletter discusses how they want to look at the limitations of RSPs, and went on to publish Declare and Justify: Explicit assumptions in AI evaluations are necessary for effective regulation and What AI evaluations for preventing catastrophic risks can and cannot do. These aren't claims that Anthropic's policies are empty, they are claims that evals are insufficient to prove safety. They're also mostly focused on such policies as AI governance rather than voluntary lab policies, which makes sense.
I was in a long conversation with Eliezer and Anthropic staff at LessOnline 2024 where he pretty clearly made a bunch of predictions of this kind. My guess is there were a lot of people there who could testify to that as well.
Thanks for letting me know. Unfortunately, if this is the only place where the prediction was made then most people can't make much of an update.
It's not a good look to make predictions in private and then later brag about them in public. Especially if you wrote Hindsight bias, for example.
Hindsight bias is when people who know the answer vastly overestimate its predictability or obviousness, compared to the estimates of subjects who must guess without advance knowledge. Hindsight bias is sometimes called the I-knew-it-all-along effect.
By contrast, consider the post Unless its governance changes, Anthropic is untrustworthy by Mikhail Samin, 2025-11-29. This gives the author very good standing for saying "I told you so".
Establishing Yudkowsky's credibility as a forecaster would decrease the probability of human extinction (perhaps from 100% to 100%) so I would encourage him to publish such predictions in the future, including any from the LessOnline 2024 batch that are still relevant. This seems especially important given the frequent critique of his world model as unfalsifiable stories of doom.
This seems pretty weird to me. Making a prediction with like 30+ people in the room, loudly and clearly is as close as you will usually ever get in this case.
Like sure, sometimes you happen to have a perfectly operationalized prediction on the public internet but "most people can't make much of an update" is obviously not how this works. It's clearly a lot of evidence! (I think just my testimony isn't that much evidence, but like, if someone asked 2-3 people what they remember about what Eliezer has said on this, then I think the resulting quotes would be quite a bit of evidence)
EDIT: I wrote the comment below in response to the first paragraph only, pre-edit. With the new version I think we're actually very close to agreement!
I don't understand what seems absurd to you. I didn't invent the concepts of hearsay, conflicts of interest, fallible human memory, hindsight bias, or selective reporting. I expect you agree that these are real phenomena. Here's my best guess at our disconnect:
If I’d ever had the faintest, tiniest credence in Anthropic’s “Responsible Scaling Policy”, I’d probably feel pretty betrayed right now!
Claim: Yudkowsky never had the faintest, tiniest credence in Anthropic’s “Responsible Scaling Policy”.
My standard of evidence: minimal. People are allowed to believe what they want about what they believed in the past. Most beliefs occur in the privacy of one's own head. Others are shared with friends, or verbally. Relatively few people share any of their beliefs in writing, and those that do share only a fraction of their beliefs. In any case, past beliefs are part of a person's identity and life story, and accepting them as stated is good etiquette.
I ask only that you update, and not always be surprised in the same direction of “huh, Eliezer was right to call it empty”.
Claim: Yudkowsky is a great forecaster, give status and trust.
My standard of evidence: high. Preferably clear advance written predictions. I would also accept personally hearing and recalling clear advance verbal predictions, or contemporaneous reports of such advance verbal predictions. So for the 30+ people who were in the room, yeah, they should update.
My guess is there were a lot of people there who could testify to that as well.
Yes, if that happens then I will also update. This would also address my concerns of not having the exact prediction made, its confidence level, or the other predictions. The typical pattern with events two years ago is that ten witnesses recall ten slightly different versions of events.
I would only be able to make this update because I have basic trust in the people likely to be present at LessOnline 2024. There's still lots of benefit in getting advance predictions on the record so that they are legible to others.
EDIT: I wrote the comment below in response to the first paragraph only, pre-edit. With the new version I think we're actually very close to agreement!
Sorry about that! Glad to see we are mostly on the same page, I noticed my original comment was ambiguous, or that I maybe understood you and so edited it.
We should maybe make it so it's easier to see when someone edits their comment after publishing. I like leaving responses quickly, but without auto-refresh or notifications on the receiving side it's easy to write a many-paragraph response without notifying you.
Re: "the RSP thresholds failed to create consensus about AI risks." (including a link to the original to make sure I'm arguing with the right thing)
The idea of using the RSP thresholds to create more consensus about AI risks did not play out in practice—although there was some of this effect. We found pre-set capability levels to be far more ambiguous than we anticipated: in some cases, model capabilities have clearly approached the RSP thresholds, but we have had substantial uncertainty about whether they have definitively passed those thresholds. The science of model evaluation isn’t well-developed enough to provide dispositive answers. In such cases, we have taken a precautionary approach and implemented the relevant safeguards, but our internal uncertainty translates into a weak external case for taking multilateral action across the AI industry.
Biological risks provide an example of this “zone of ambiguity”. Our models now show enough biological knowledge that they pass most tests we can run quickly and easily, so we can no longer make a strong argument that risks are low from a given model. But these tests alone aren’t sufficient for a strong argument that risks are high, either. We’ve sought additional evidence, such as supporting an extensive wet-lab trial, but results remain ambiguous, especially because the studies take long enough that more powerful models are available by the time they’re completed.
One thing salient to me is that I think we're juuuuuuust approaching the point where it even particularly made sense to be worried about risks. I wouldn't have expected consensus to emerge before people were confronted with the issues becoming real.
So this framing feels particularly lame to me. Well, yeah it hadn't created consensus yet, but that's AFAICT because you didn't try much to achieve that. A significant fraction of what seemed like "the point of having an Anthropic" to me, was that right about now, they could be ringing the alarm bells saying:
"Look, we are approaching the point where the models are legitimately dangerous. The commitments we made a few years ago are triggering. We still don't have good ability to tell if they're dangerous, and neither does any other company either. Please, let's either get the government involved now, or come to some kind of frontier-lab agreement."
There's the awkwardness now of the DoW interactions, and maybe this whole plan was predicated on Trump not winning. And maybe relationships between OAI and Anthropic are too sour for a frontier lab agreement to work out. But, I don't think it would have made any sense to expect consensus without Anthropic taking a more costly public action than it did.
We found pre-set capability levels to be far more ambiguous than we anticipated
Maybe this is unrelated to what you're saying but this quote also feels particularly lame. Basically this is what happened from my perspective
Anthropic: writes ambiguous criteria in RSP
people: hey your criteria are too ambiguous
Anthropic: don't worry, we will know it when we see it
time passes
new models come out; it's ambiguous whether they satisfy the ambiguously-written criteria
Anthropic: surprised Pikachu
Would be interested in receipts for people saying "your criteria are too ambiguous", particularly cases of people who suggested less ambiguous criteria that appear in hindsight to have been good ones or who specifically predicted "you will fail to rule out these thresholds well before you will be able to rule them in".
I don't know of any such cases and would be impressed by anyone who did this at the time; as far as I know, coming up with good non-ambiguous criteria, or correctly anticipating the shape of their failure, was just actually very hard and no one did it.
Found one while looking for something else: Anthropic leadership conversation.
Jared Kaplan at 25:11 Pushes back a little: all of the above were reasons to be excited about the RSP ex ante but it's been surprisingly hard and complicated to determine evals and thresholds; in AI there's a big range where you don't know whether a model is safe. (Kudos.)
Not sure if that's what you're looking for.
Would be interested in receipts for people saying "your criteria are too ambiguous"
It's something I thought in my head. I don't recall any specific public comments but I'm sure someone said it at some point.
particularly cases of people who suggested less ambiguous criteria that appear in hindsight to have been good ones
It is very much not my job to propose criteria that would make it safe to proceed with building potentially world-ending technology. It is the job of the people who are building it. If they can't come up with a clear explanation of why they're not going to kill everyone, then prima facie they are morally obligated to not build it.
Oh, yeah I agree that also is separately lame. (although I am a bit less clear on the receipts for whether the conversation went the way you summarized).
I don't think that it's just the OAI/Anthropic feud. Remember Zvi reposting Mollick who estimated that xAI is 7 months behind? Ryan Shea estimated Grok to be 3 months behind instead of 7, and there also are Grok 5 in training[1] and the leaked Claude Mythos in evaluation. Any frontier lab agreement would require them to either include xAI or to unite their efforts against xAI in a manner similar to the AI-2027 slowdown ending or to Elaris Labs uniting efforts with NeuroMorph. Or activists or frontier lab lobbyists could try to file lawsuits against xAI and destroy it or slow it down.
Alas, xAI doesn't have the habit of evaluating its models for dangerous capabilities or misalignment.
Drake counteroffers ‘committed to under this policy’ but no, I think that’s wrong. I think the right word is ‘intending.’
I'd use "Anthropic's policy is to do X". I think that's fine for statements about the future too, e.g. "Under RSP v3, Anthropic's policy is to do X for future powerful models".
I think publicly adopting a policy is more meaningful than stating an intention, but there's no implication that policies can't be changed.
Bayes points aside, outside view, there's a reason so many human works of fiction center on a betrayal of stated commitments in the face of sufficient perceived reasons. Sometimes as the world literally burns.


Could Dario sell an option to turn 2 OpenAI shares and $1000 into 1 Anthropic share?
(Then if OpenAI wants to slow down to satisfy Anthropic's condition, they can buy that option as insurance and deterrent against Anthropic ignoring their commitment and pulling ahead enough to make the option worth exercising.)
Anthropic has revised its Responsible Scaling Policy to v3.
The changes involved include abandoning many previous commitments, including one not to move ahead if doing so would be dangerous, citing that given competition they feel blindly following such a principle would not make the world safer.
Holden Karnofsky advocated for the changes. He maintains that the previous strategy of specific commitments was in error, and instead endorses the new strategy of having aspirational goals. He was not at Anthropic when the commitments were made.
My response to this will be two parts.
Today’s post talks about considerations around Anthropic going back on its previous commitments, including asking to what extent Anthropic broke promises or benefited from people reacting to those promises, and how we should respond.
It is good, given that Anthropic was not going to keep its promises, that it came out and told us that this was the case, in advance. Thank you for that.
I still think that Anthropic importantly broke promises, that people relied upon, and did so in ways that made future trust and coordination, both with Anthropic and between labs and governments, harder. Admitting to the situation is absolutely the right thing, but doing so does not mean you don’t face the consequences.
Friday’s post dives into the new RSP v3.0 and the accompanying Roadmap and Risk Report, in detail.
Note that yes this is being posted on April Fools Day, but this post is only an April Fools joke insofar as those who believed Anthropic’s previous RSPs are now the April Fool.
Promises, Promises
If your initial promises were a mistake, it may or may not be another mistake to walk them back. Either way, even if your promises were not hard commitments, walking them back involves paying a price for having broken your promises, even if you had a strong reason to break them. How big a price depends on the circumstances.
Almost all mainstream coverage of this event framed it as abandoning or walking back Anthropic’s core safety promises, especially ‘do not scale models to a dangerous level without adequate safeguards.’ As a central example of this, The Wall Street Journal said ‘Anthropic Dials Back AI Safety Commitments’ due to competitive pressures. That oversimplifies the situation, leaving a lot out, but doesn’t seem wrong.
Many outsiders who follow the situation more closely believe this amounts to Anthropic having broken its commitments. Some go so far as to say this means that lab commitments to safety should not be considered worth the paper that they were never printed on. Many now expect Anthropic to make some amount of effort, but nothing that would much interfere with business plans. If Anthropic can’t make the commitment, why should anyone else? Certainly this government is not going to help.
Don’t be afraid to tell them how you really feel. They welcome it. So here we go.
Anthropic Responsible Scaling Policy v3
The Responsible Scaling Policy is Anthropic’s commitments regarding when and under what conditions they will release frontier models.
The headline change is that they are no longer committed to not releasing potentially unsafe models, if someone else did it first. Cause, you know, they started it.
That Could Have Gone Better
Anthropic starts their new analysis by going over their theory of change from having an RSP at all, and whether those theories were realized. They report a mixed bag.
First, the good news.
Then the bad news.
I’m Just Not Ready To Make a Commitment
What’s the most important differences in the new version?
Anthropic is now basically giving up on hard commitments and barriers to releasing models, relying instead on ‘we will make reasonable-to-us arguments’ and decide that the benefits exceed the risks.
I appreciate the honesty. Really, I do.
If you’re not ready to make a commitment, and you realize you shouldn’t have made one, then the second best time to realize and admit that fact is right now.
Officially breaking the commitments now is higher integrity than silently breaking them later. It’s especially better than silently changing the RSP right before a release. I approve of Charles’s frame of ‘Anthropic stopped pretending to have red lines at which they will unilaterally pause.’
If Anthropic was in practice already doing a ‘we think our arguments are reasonable’ decision process, which with Opus 4.6 it seemed like they mostly were, then better to admit it than to pretend otherwise.
I want to emphasize that essentially no one, not even those who disagree with me and think Anthropic should pause, and who also think Anthropic made rather strong commitments it is now breaking, are saying ‘Anthropic should be holding to its previous commitments purely because they said so, even if this leads to pausing that does not make sense.’
One still has to be held to account for breaking promises, and for making promises that were inevitably going to be broken, even if the decision to break them is right. Your defense that the move was correct does not excuse you from its consequences.
I also want to emphasize that commitments are only one way to improve safety. Even when plans are worthless, planning is essential, and you can and should just do things. None of this means ‘Holden or Anthropic don’t care about safety,’ only that they will decide what they think is right and then do it, and you can decide how much you trust them to choose wisely.
I do still see this as Anthropic abandoning its experiment on importantly engaging in voluntary self-government and restricting itself. Technically they reserved the right to do it, but it’s still quite the gut punch to do it.
The experiment is over. That’s better than pretending the experiment is working.
From this point, there are no commitments, only statements of intent. Anthropic’s going to do what it’s going to do. You can either choose to trust Anthropic’s leadership to make good decisions, or you can choose not to.
I think Anthropic’s description of its own history says that having these softly binding commitments, and having a track record of treating it as costly to break them, was very good for safety outcomes and policy adoption. I hate that we’ve given that up.
So Cold, So Alone
If your commitment is conditional on the actions of others, you should say that.
They didn’t entirely not say this before, but it was very much phrased as ‘in case of emergency we might have to break glass’ rather than ‘we only hold back if everyone relevant signs on.’
RSPv2 said this in 7.1.7: “If another frontier AI developer passes or is about to pass a Capability Threshold without implementing equivalent Required Safeguards, such that their actions pose a serious risk to the world, then because the incremental risk from Anthropic would be small, Anthropic might lower its Required Safeguards. If it did so, it would acknowledge the overall level of risk posed by AI systems (including its own) and invest significantly in making a case to the U.S. government for regulatory action.”
Whereas Anthropic is now saying they’re willing to hit those thresholds first, unless they have explicit commitments from others to do otherwise, even if this is not a small incremental risk.
I strongly agree with aysja, and disagree with Holden, that it would be misleading to describe this shift as a ‘natural extension of the RSP being a living document.’
I do see the argument that goes like this:
If that is where we are at now, you have all the reason to make this stricter requirement clear up front. That gives others more reason to follow you, and avoids all the nasty headlines we’re seeing now. Alas. it’s a little late for that.
If the mistake has already been made, it’s not obviously bad to admit defeat, and say you’re not going to then let someone else potentially dumber and riskier get there first.
I definitely agree it’s better to announce your intention to violate your old policy now, rather than wait until the day you do violate the old policy, which might never come.
The main catch is, it sounds like ‘you should see one of the other guys’ is going to be used as a basically universal excuse to go forward essentially no matter how risky it is, if the cost of not doing so is high?
If Anthropic does in the future pause for an extended period, in a way that is importantly costly, then I will have been wrong about this and precommit to saying so in public. If I don’t do so, please remind me of this.
As Drake Thomas notes, the virtue ethical case for ‘don’t impose material existential risk on the planet’ is reasonably strong.
One problem is that this absolutely is going to weaken the willingness of others to incur costs, and embolden those who want to move forward no matter what. Endorsing race logic and the impossibility of cooperation has its consequences.
I’m Sorry I Gave You That Impression
What do you mean the RSP was committing Anthropic to things?
This certainly sounds like Evan Hubinger basically attacking anyone for daring to question that the RSP represented de facto strong commitments by Anthropic. We now know it did not strongly commit Anthropic to anything.
Evan predicted there was a substantial change Anthropic’s commitments would at some point force it to pause. Oliver made a market on that, which is now at ~0% despite rapid capabilities progress and Anthropic now arguably being in the lead.
Even after the RSPv3 release, Evan Hubinger continued to defend his position, that he was only saying that the RSP made a clear statement about where the lines were, not that the lines would not change or actually work in practice. Like Oliver I find this highly convincing given a plain reading of Evan’s comment. I do appreciate Evan saying now that we should downweight the theory of RSPs.
So the question then becomes, were Evan Hubinger and other employees who talked similarly under a false impression? If so, why? If not, why talk this way?
Oliver Habryka could not be more clear here, and I don’t think he would lie about this:
Oliver notes that Holden Karnofsky in particular has previously communicated he felt this was a different and lower level of commitment, that is consistent with him pushing the changes in v3, in contrast to many other Anthropic employees.
As Oliver Habryka says here, if Evan was under this false impression, Anthropic benefited enormously from giving its senior employees like Evan this impression. This does not seem like a ‘mistake’ from Anthropic to do this, and it would not be reasonable from the outside view to consider it an accident.
At minimum, if you don’t admit Anthropic has importantly now broken its commitments, then this is all highly misleading use of the word ‘commitment.’
In particular, yes, a lot of people who care about not dying felt that the central point of RSPs was as a de facto compromise, an attempt to put an if-then commitment trigger on slowing down or pausing. If you couldn’t match the conditions, then you have to pause, which makes it acceptable to move forward now.
Indeed one could go further. The entire program of focusing heavily on not only Anthropic but evaluation-based organizations like METR and Apollo was that the evals could constitute the if that triggers a then. We now know that such commitments do not work, and that when models pass the dangerous capability tests even Anthropic will likely then fall back upon vibes. METR’s theory of change is ‘ensure the world is not surprised’ but I expect them to still be surprised.
Alternatively or in addition, you can interpret it as Holden does, that ‘no one has any willingness to slow down, and until there is a crisis this won’t change.’ Now the attitude is essentially ‘pausing or slowing down would be akin to suicide for a frontier AI lab, so things would have to be super extreme to do that, this is more of a plan we aspire to.’ Which is also a fine thing, but a very different style of document. Those who thought it was the first type of document lose Bayes points. Whereas those who thought it was the second type of document win Bayes points.
One could interpret a lot of this as ‘Anthropic employees implied they were using Rationalist epistemic norms, but instead they were using a different set of norms.’
Fool Me Twice
Does this backtrack remind you of anything?
It should. In particular, it should remind you of what happened with the idea that Anthropic would not ‘push the frontier of AI capabilities.’
A lot of people told us, with various wordings and degrees of commitment attached, that Anthropic would not do that. Then Anthropic sort of did it. Then they totally flat out did it and now Claude Code and Claude Opus 4.6 are very clearly the frontier.
Then we were told, ‘oh we never promised not to do that.’
Maybe they didn’t strictly promise to do that. Maybe a lot of telephone games were involved, but Anthropic at minimum damn well should have known that a lot of people were under that impression. I was under that impression. And they knew that people were making major life decisions, and deciding whether and how much to support Anthropic, on the basis of that decision, with no sign anyone ever did anything to correct the record.
Now we’re being told, again, ‘oh we never promised not to [undo our commitments].
You’re trying to tell us what about your new commitments, then?
Eliezer and others are constantly getting flak for predicting things that, in broad terms, do indeed seem to keep on reliably happening, everywhere. People constantly say ‘we will not do [X]’ or ‘in that case we would definitely do [Y]’ or heaven forbid ‘no one would be so stupid as to [Z]’ and then you turn around those same people did [X] and didn’t do [Y] and a lot of people did [Z] and you’re treated as a naive idiot for having ever taken the alternative seriously.
Best update your priors. All the people who said commitments wouldn’t hold get Bayes points. Those who didn’t lose Bayes points.
All the people who are now saying new ‘commitments’ matter and they really mean it this time? They don’t matter zero, but they are not true commitment.
I also don’t understand, given its composition and past Anthropic actions, why I should put that much stock in the Long Term Benefit Trust. It’s better to have it in its current form than not have it, but it was an important missed opportunity.
Anthropic definitely gets meaningful points on this front for standing up for what it believed in during the confrontation with the Department of War, even if you think those particular choices were unwise. I think there’s a lot more hope for actions of the form ‘Anthropic or another lab takes this particular stand right now’ than ‘Anthropic or another lab will take this particular stand later.’
In My Defense I Was Left Unsupervised
Holden offers a defense of the new RSP here and here, essentially saying that binding commitments are bad, because we don’t have enough information to choose them wisely, so you might choose poorly and regret them later, and indeed Anthropic did previously sometimes choose poorly and now is later and they’re regretting it. So sayeth all those who wish to not make any binding commitments.
I interpret Holden, despite his saying he has a document where he wrote down where he would think a unilateral pause would be a good idea, as saying that they are going to do their best to do appropriate mitigations, but ultimately yes, they are going to release models, both internally and externally, pretty much no matter what mitigations are or are not available short of ‘okay yeah this is obviously a really terrible idea that will get us all killed or at least blow up directly in our faces,’ and they’re simply admitting this was always true. Okay, then.
Holden basically says in particular that he doesn’t think Anthropic should slow down based on inability to prevent theft of model weights, even if it crosses the ‘AI R&D-5’ threshold that is at least singularity-ish. They’re going to go ahead regardless. They’re not going to stop. I worry a lot both about the not stopping, and that without the forcing function of having to stop, they even more so than before won’t invest sufficiently in the necessary precautions, here or elsewhere. They not only can’t stop, won’t stop, they won’t halt and catch fire.
A list of aspirational goals is a good thing to have. I don’t think a list of aspirational goals is going to create sufficient threat of looking terrible to provide the same incentives here. That doesn’t mean the list of goals cannot do good work in other ways.
I see Holden complaining a lot about people ‘seeing RSPs as having hard commitments’ and using that as an additional reason to get rid of all the commitments. He’s pointing to all the complaining that Anthropic just broke its commitments and saying ‘see? This reaction is all the more reason we had to break all our commitments.’
It was exactly the enforcement mechanism that, if you break the commitments, people will get mad at you. This is why we can’t
stay alivehave nice things. So now we will have aspirations.Aspirations are helpful, they substantially raise the chance you will do the thing, but they are weak precommitment devices when you decide you won’t do the thing later.
I also think his own argument of ‘it’s much easier to require things labs already committed to doing’ works directly against the ‘don’t commit to anything’ plan.
Drake Thomas Finds The Missing Mood
Drake Thomas thinks the move from v2.2 to v3.0 is an improvement, while noticing the need to have something like mourning or grief for the spirit of the original v1.0, which is now gone and proven not viable in practice at Anthropic.
I get Drake’s frustrations. But yes, most people are going to litigate the removal of the core commitment around pausing and general revelation that so-called commitments aren’t so meaningful after all. Most attention is going to go there. He makes clear that he gets it, and I’d say he passes the ITT about why people are and have a right to be pissed off, especially that we had language in v1.0 saying that the bar for altering commitments was a lot higher than it ultimately was.
And indeed, a lot of our attention likely should go there, because if the new statements aren’t commitments, it is a lot harder to productively critique them.
Things That Could Have Been Brought To My Attention Yesterday (1)
Well, you see, not rushing ahead as fast as possible might slow us down. That would be bad. You wouldn’t want us to do that, would you?
Besides, we aren’t able to evaluate models as fast as we are able to improve them, which means we should triage the evaluations and kind of wing it. I mean, what do you want us to do, not release frontier AI models we can’t evaluate? Silly wabbit.
That does seem likely and sound concerning.
Things That Could Have Been Brought To My Attention Yesterday (2)
In other need-to-know news, Sean asked a very good question. Drake’s answer to this was about as good as one could have hoped for, given the facts.
If you’ve decided to break your ‘commitment,’ you want to tell us as soon as possible.
I have confirmation that the board only approved the changes ‘very recently.’
There is then a discussion of how to think about ‘Oliver is right in general but this particular quote is a bad example,’ which I find to be a helpful thing to say if that’s what you think.
What We Have Here Is A Failure To Communicate
I think this is also important context. Dario Amodei and Anthropic have been consistently unwilling, with notably rare exceptions, to say the full situation out loud, or to treat it with proper urgency. Yes, you should see the other guy and all that, fair point, but when you are saying ‘no one wants to [X] so we have to change our plan’ you need to have been calling for [X] and explaining why, and also loudly explaining that this is terrible and forcing you to change plans.
I don’t see that type of communication out of Anthropic leadership, over the course of years.
You Should See The Other Guy
But they assure us it’s all fine, they are committed to doing as well or better than rivals.
So, first off, no. As I discussed above, you’re not committed. Stop saying you’re committed to things you’re not committed to. You keep using that word.
We’ve just established you can and will back out of ‘commitments’ if you change your mind. You don’t to say ‘commitment’ in an unqualified way anymore, sorry.
Even if we assume this ‘commitment’ is honored, reality does not grade on a curve. Saying ‘I will be as responsible as the least responsible major rival’ is no comfort. You’re Anthropic. If that’s your standard, then you’re not helping matters.
The good news is I expect Anthropic to still do much better than that standard. But that’s purely because I think and hope they will choose to do better. It’s not because I think they are committed to anything.
I don’t want to hear Anthropic or any of its employees say they are ‘committed’ to something unless they are actually committed to it, ever again.
Drake counteroffers ‘committed to under this policy’ but no, I think that’s wrong. I think the right word is ‘intending.’
I Was Only Kidding
They Can’t Keep Getting Away With This
Actually, it kind of seems like they can and probably will.
Damn Your Sudden But Inevitable Betrayal
If the betrayal was inevitable, there are two ways to view that.
It makes the particular incident sting less, but it also means they’ll betray you again, and you should model them as the type of people who do a lot of this betrayal thing.
I mean, when Darth Vader says ‘I am altering the deal, pray I do not alter it any further’ it’s a you problem if you’re changing your opinion of Darth Vader, but also you should expect him to be altering the deal again.
Okay. That all needed to be said. On Friday I’ll look at the new RSP on its own merits.