Dylan Matthews had in depth Vox profile of Anthropic, which I recommend reading in full if you have not yet done so. This post covers that one.

Anthropic Hypothesis

The post starts by describing an experiment. Evan Hubinger is attempting to create a version of Claude that will optimize in part for a secondary goal (in this case ‘use the word “paperclip” as many times as possible’) in the hopes of showing that RLHF won’t be able to get rid of the behavior. Co-founder Jared Kaplan warns that perhaps RLHF will still work here.

Hubinger agrees, with a caveat. “It’s a little tricky because you don’t know if you just didn’t try hard enough to get deception,” he says. Maybe Kaplan is exactly right: Naïve deception gets destroyed in training, but sophisticated deception doesn’t. And the only way to know whether an AI can deceive you is to build one that will do its very best to try.

The problem with this approach is that an AI that ‘does its best to try’ is not doing the best that the future dangerous system will do.

So by this same logic, a test on today’s systems can only show your technique doesn’t work or that it works for now, it can never give you confidence that your technique will continue to work in the future.

They are running the test because they think that RLHF is so hopeless we can likely already prove, at current optimization levels, that it is doomed to failure.

Also, the best try to deceive you will sometimes be, of course, to make you think that the problem has gone away while you are testing it.

This is the paradox at the heart of Anthropic: If the thesis that you need advanced systems to do real alignment work is true, why should we think that cutting edge systems are themselves currently sufficiently advanced for this task?

But Anthropic also believes strongly that leading on safety can’t simply be a matter of theory and white papers — it requires building advanced models on the cutting edge of deep learning. That, in turn, requires lots of money and investment, and it also requires, they think, experiments where you ask a powerful model you’ve created to deceive you.

“We think that safety research is very, very bottlenecked by being able to do experiments on frontier models,” Kaplan says, using a common term for models on the cutting edge of machine learning. To break that bottleneck, you need access to those frontier models. Perhaps you need to build them yourself.

The obvious question arising from Anthropic’s mission: Is this type of effort making AI safer than it would be otherwise, nudging us toward a future where we can get the best of AI while avoiding the worst? Or is it only making it more powerful, speeding us toward catastrophe?

If we could safely build and work on things as smart and capable as the very models that we will later need to align, then this approach would make perfect sense. Given we thankfully cannot yet build such models, the ‘cutting edge’ is importantly below the capabilities of the future dangerous systems. In particular they are on different sides of key thresholds where we would expect current techniques to stop working, such as the AI becoming smarter than those evaluating its outputs.

For Anthropic’s thesis to be true, you need to thread a needle. Useful work now must require cutting edge models, and those cutting edge models must be sufficiently advanced to do useful work.

In order to pursue that thesis and also potentially to build an AGI or perhaps transform the world economy, Anthropic plans on raising $5 billion over the next two years. Their investor pitch deck claims that whoever gets out in front of the next wave can generate an unstoppable lead. Most of the funding will go to development of cutting edge frontier models. That certainly seems to, at best, be a double edged sword.

Even if Anthropic is at least one step behind in some sense, Dylan gives an excellent intuition pump for why that still accelerates matters:

Coca-Cola is comfortably ahead of Pepsi in the soft drinks market. But it does not follow from this that Pepsi’s presence and behavior have no impact on Coca-Cola. In a world where Coca-Cola had an unchallenged global monopoly, it likely would charge higher prices, be slower to innovate, introduce fewer new products, and pay for less advertising than it does now, with Pepsi threatening to overtake it should it let its guard down.

What is the plan to ensure this is kept under control?

Anthropic Bias

Anthropic also gave me an early look at a wholly novel corporate structure they are unveiling this fall, centering on what they call the Long-Term Benefit Trust. The trust will hold a special class of stock (called “class T”) in Anthropic that cannot be sold and does not pay dividends, meaning there is no clear way to profit on it. The trust will be the only entity to hold class T shares. But class T shareholders, and thus the Long-Term Benefit Trust, will ultimately have the right to elect, and remove, three of Anthropic’s five corporate directors, giving the trust long-run, majority control over the company.

Right now, Anthropic’s board has four members: Dario Amodei, the company’s CEO and Daniela’s brother; Daniela, who represents common shareholders; Luke Muehlhauser, the lead grantmaker on AI governance at the effective altruism-aligned charitable group Open Philanthropy, who represents Series A shareholders; and Yasmin Razavi, a venture capitalist who led Anthropic’s Series C funding round. (Series A and C refer to rounds of fundraising from venture capitalists and other investors, with A coming earlier.) The Long-Term Benefit Trust’s director selection authorities will phase in according to time and dollars raised milestones; it will elect a fifth member of the board this fall, and the Series A and common stockholder rights to elect the seats currently held by Daniela Amodei and Muehlhauser will transition to the trust when milestones are met.

The trust’s initial trustees were chosen by “Anthropic’s board and some observers, a cross-section of Anthropic stakeholders,” Brian Israel, Anthropic’s general counsel, tells me. But in the future, the trustees will choose their own successors, and Anthropic executives cannot veto their choices. The initial five trustees are:

Trustees will receive “modest” compensation, and no equity in Anthropic that might bias them toward wanting to maximize share prices first and foremost over safety. The hope is that putting the company under the control of a financially disinterested board will provide a kind of “kill switch” mechanism to prevent dangerous AI.

The trust contains an impressive list of names, but it also appears to draw disproportionately from one particular social movement [EA].

This is a great move.

  1. There will be a (potentially!) robust mechanism to prioritize real safety.
  2. The need to choose the right board that you will lose control over also serves as good practice for a highly important safety activity that demands you figure out what you value and risks irreversible loss of control, that you thus must get right on the first try.
  3. It is also good practice for a situation in which one must simultaneously make a key safety decision and also use the details of its implementation as a costly public signal, forcing a potential conflict between the version that looks best and the version that would work best, again with potentially catastrophic consequences for failure.

This is a very EA-flavored set of picks. Are they individually good picks?

I am confident Paul Christiano is an excellent pick. I do not know the other four.

Simeon notes the current Anthropic board of six is highly safety-conscious and seems good, and for the future suggests having one explicit voice of (more confident and robust) doom for the board. He suggests Nate Sores. I’d suggest Jaan Tallinn, who is also an investor in Anthropic. If desired I am available.

My biggest concerns are Matt Levine flavored. To what extent will this group control the board, and to what extent does the board control Anthropic?

How easy will it be to choose three people for the board who will stand together, when the Amodeis have two votes and only need one more? If Dario decides he disagrees with the board, what happens in practice, especially if a potentially world-transforming system is involved? If this group is upset with the board, how much and how fast can they influence the board, either in normal mode and via their potential power to act in the future, or when they are in an active conflict and have to fight to retake control? Safety delayed is often safety denied. Could the future current board change the rules? Who controls what a company does?

In such situations, everyone wanting to do the right thing now is good, but does not substitute for mechanism design or game theory. Power is power. Parabellum.

Simeon notes that Jack Clark’s general unease about corporate governance.

Jack Clark: I am pretty skeptical of things that relate to corporate governance because I think the incentives of corporations are horrendously warped, including ours.

This is a great example of the ‘good news or bad news’ game. If you don’t already know that the incentives of corporations are horrendously warped in ways no one including Anthropic knows how to solve, and how hard interventions like this are to pull off in practice, then this quote is bad news.

However, I did already know that. So this is instead great news. Much better to realize, acknowledge and plan for the problem. No known formal intervention would have been sufficient on its own. Aligning even human-level intelligence is a very difficult unsolved problem.

I also do appreciate the hypothesis that one could create a race to safety. Perhaps with the Superalignment Taskforce at OpenAI they are partially succeeding. It is very difficult to know causality:

Dario Amodei (CEO of Anthropic): Most folks, if there’s a player out there who’s being conspicuously safer than they are, [are] investing more in things like safety research — most folks don’t want to look like, oh, we’re the unsafe guys. No one wants to look that way. That’s actually pretty powerful. We’re trying to get into a dynamic where we keep raising the bar.

We’re starting to see just these last few weeks other orgs, like OpenAI, and it’s happening at DeepMind too, starting to double down on mechanistic interpretability. So hopefully we can get a dynamic where it’s like, at the end of the day, it doesn’t matter who’s doing better at mechanistic interpretability. We’ve lit the fire.

I am happy to let Dario have this one, and for them to continue to make calls like this:

Anthropic: We all need to join in a race for AI safety.  In the coming weeks, Anthropic will share more specific plans concerning cybersecurity, red teaming, and responsible scaling, and we hope others will move forward swiftly as well.

The key is to ensure that the race is aiming at the right target, and can cut the enemy. The White House announcement was great, I will cover it on Thursday. You love to see it. It still is conspicuously missing a proper focus on compute limits and the particular dangers of frontier models – it is very much the ‘lightweight’ approach to safety emphasized by Anthropic’s Jack Clark.

This is part of a consistent failure from Anthropic to publicly advocate for regulations that have sufficient bite to keep us alive. Instead they warn to stick to what is practical. This is in contrast to Sam Altman of OpenAI, who has been a strong advocate for the need to lay groundwork to stop or at least aggressively regulate frontier model training. Perhaps this is part of a longer political game played largely out of the public eye. I can only judge by what I can observe.

The big worry is that while Anthropic has focused on safety in some ways in its rhetoric and behavior, building what is plausibly a culture of safety and being cautious with its releases, in other areas it seems to be failing to step up. This includes its rhetoric regarding regulations. I would love for there to be a fully voluntary ‘race to safety’ that sidesteps the need for regulation entirely, but I don’t see that as likely.

The bigger concern of course is the huge raise for the explicit purpose of building the most advanced model around, the exact action that one might reasonably say is unsafe to do. There is also the failure to yet establish corporate governance measures to mitigate safety concerns (although the new plan could change this.

It is also notable that, despite it being originally championed by CEO Dario Amodei, Anthropic has yet to join OpenAI’s ‘merge and assist’ clause, which reads:

We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project.

This is a rather conspicuous refusal. Perhaps such a commitment looks good when you expect to be ahead, not so good when you expect to be behind, and especially not good when you do not trust those you think are ahead? If Anthropic’s attitude is that OpenAI (or DeepMind) is sufficiently impossible to trust that they could not safely merge and assist, or that their end goals are incompatible, then they would be putting additional pressure on for the other, unsafe, lab to move quickly when they most need to not do that. It is exactly when you do not trust and agree with your rival that cooperation is needed most.

There is some hand wringing in the post about Anthropic not being a government project, along with the simple explanation of why it cannot possibly be one, which is that the government would not have funded it in any reasonable time frame, and even if they did they would never be willing to pay competitive salaries. It also for similar reasons is not a non-profit, which will push Anthropic in unfortunate directions in ways that could well be impossible to fight against. If they were a non-profit, they would not be getting billions in funding, and also people would again look askance at the compensation.


The profile updated me positively on Anthropic. The willingness to grant this level of access, and the choice of who to give it to, are good signs. The governance changes are excellent. The pervasive attitude of being deeply worried is helpful. Also Dario has finally started to communicate more in public more generally, which is also good.

There is still much work to be done, and there are still many ways Anthropic could improve. Their messaging and lobbying needs to get with the program. Their concepts of alignment seem to not focus on the hard problems, and they continue to put far too much stock (in my opinion) on techniques doomed to fail.

Most concerning of all, Anthropic continues to raise huge amounts of money intended for the training of frontier models designed to be the best the world. They make the case that they do this in the name of safety, but they tell investors a different story that is highly plausible to be true in practice whether or not Anthropic’s principals want to believe it will prove true.

Which brings us back to the central paradox: If the thesis that you need advanced systems to do real alignment work is true, why should we think that cutting edge systems are themselves currently sufficiently advanced for this task?

Similarly, why should we think that techniques that work in the most advanced current (or near future) systems will transfer to future more capable AGIs, if techniques on past systems wouldn’t? If we don’t think previous systems were ‘smart enough’ for the in-future effective techniques to work on them, what makes us think we are suddenly turning that corner as well?

The thesis that you need exactly the current best possible models makes sense if the alignment work aims to align current systems, or systems of similar capability to current systems. If you’re targeting systems very different from any that currently exist, more so than the gap between what Anthropic is trying to train and what Anthropic already has, what is the point?

Remember that Anthropic, like OpenAI and DeepMind, aims to create AGI.

Final note, and a sign of quality thinking:

Anthropic alignment researcher Amanda Askell offers the following correction: This says I’m a “philosopher-turned-engineer” but I prefer “philosopher-turned-researcher” so that no one expects my code to be good.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 9:51 AM

Which brings us back to the central paradox: If the thesis that you need advanced systems to do real alignment work is true, why should we think that cutting edge systems are themselves currently sufficiently advanced for this task?


I really like this framing and question.

My model of Anthropic says their answer would be: We don't know exactly which techniques work until when or how fast capabilities evolve. So we will continuously build frontier models and align them.

This assumes at least a chance that we could iteratively work our way through this. I think you are very skeptical of that. To the degree that we cannot, this approach (and to a large extent OpenAI's) seem pretty doomed.