Recently, Joe Carlsmith switched to working at Anthropic. He joins other members of the larger EA and Open Philanthropy ecosystem who are working at the AI lab, such as Holden Karnofsky. And of course many of the original founders were EA affiliated.
In short, I think Anthropic is honest and is attempting to be an ethical AI lab, but they are deeply mistaken about the difficulty they are facing and are dangerously affecting the AI safety ecosystem. My guess is that Anthropic for the most part is actually being internally honest and not consciously trying to deceive people. When they say they believe in being responsible, I think that's what they genuinely believe.
My criticism of Anthropic is based on them not having a promising plan and creating a dangerous counter-narrative to AI safety efforts. It's simply not enough to develop AI gradually, perform evaluations and do interpretability work to build safe superintelligence. With the methods we have, we're just not going to reach safe superintelligence. Gradual development (RSP) only has a small benefit—on a gradual scale, you may be able to see problems emerge, but it doesn't tell you how to solve them. The same goes for evaluations and interpretability—they may tell you that problems exist, but they don't tell you how to do things safely. Fundamentally, they will not survive the "before and after" regime shift.
In Dario's essay "Machines of Loving Grace," he describes a dream of a world with "countries of geniuses in datacenters" and that in the long run all physical labor will be replaced by automation.
Let's imagine this future for a moment, superintelligent AI in a fully automated economy. All necessities are handled by AI, robots are getting mass produced in automated factories. There's a billion humanoid robots going around, doing all physical tasks of the economy. Human labor is relegated to creative, optional tasks. All infrastructure is automated–including AI fabs and power stations. Key decisions are getting made with the advice of AI systems.
I believe that it should be clear and undeniable that in this world, AI would be extremely dangerous. At this point, if the AI wanted to, it could essentially just flip a switch to disempower and kill us all; this future is a regime in which AI is actively dangerous.
If you look into the future and you see a world with nations of geniuses in datacenters and an automated economy–and that's where we're trying to steer the future to—you must accept that at some point we will transition from a "before" to an "after." The transition from a passively safe to a dangerous regime.
Importantly, we won’t be able to step back if we make a mistake and things go catastrophically wrong, since we can’t step back from death. Therefore we get one critical leap at which our assumptions and methods have to hold. This property of the alignment problem makes learn-as-we-go methods fundamentally non-viable.
When you look at the main efforts of Anthropic, much of it can be summarized as responsible scaling, evaluations and interpretability. They seemingly believe in gradually trying to scale to superintelligent AI while evaluating and interpreting it. This comes in addition to methods that change the observed behavior of AIs such as RL from AI feedback with carefully designed constitutions. Some of these evaluations are quite advanced and may even point at the misbehavior of future much more powerful AI. For example, the work on alignment faking, or work on using SAE features to understand evaluation awareness.
By trying to scale gradually and responsibly, we could see some warning signs such as undesirable behavior in weaker systems. But it certainly will not tell us how to build a safe AI. Furthermore, as we scale these systems and try new methods, we're never quite sure about the capability of the next generation of AI systems. This will definitely not be true if there is a paradigm breakthrough that will make AI much more capable.
The same goes for evaluations. It's good to run evaluations and they may tell us about early signs of AI misbehavior if it becomes more powerful. But they won't tell us how to build an AI that doesn't behave like that. For evaluations, we also have the issue of evaluation awareness, where these systems understand what's going on. In reality, it will be very difficult to convince a very capable system that it is really in charge of running a mining operation where it has to make decisions that test its moral judgment. These evaluations are going to be very different environments from those a superintelligent AI sees when it is on the cusp of taking over.
Interpretability might tell us that some of the internals include scheming or evaluation awareness, but it doesn't tell us how to build AIs that don't have these features. I think interpretability has the highest probability of eventually perhaps helping us to truly get systems that are safe by understanding them first. But for interpretability, it's all just so early and it’s not clear that the current technology can be understood at a deep enough level by us.
If you believe that we will at some point step into a dangerous regime, and that we cannot meaningfully step back, we cannot rely on methods that in the best case show us warning signs. We would need methods that tell us how to build a safe system ahead of time that reliably works in a very different distribution that we cannot meaningfully test for, and they have to work on the first try.
What makes Anthropic worrisome is that they are developing a powerful counter-narrative to what actual alignment and safety would take. But let’s consider that there are roughly three groups in the AI debate:
Without trying to evaluate the truth of these positions it seems clear that the first two are much more similar to each other; Anthropic is pursuing superintelligence just like the other AI labs.
Most people can recognize that "there's zero danger from AI" makes no sense—I’ve never heard any even vaguely reasonable argument from those people. In comparison, Anthropic's narrative appears credible to policymakers and talented individuals. This intellectual counter-narrative is in some ways more dangerous than flat-out denial, because it actually reaches powerful people and policymakers, whereas pure dismissal of AI risks is easier to see through.
Superalignment
Many of these people are hoping to develop superalignment and hope that we can use the superalignment strategy as we are crossing into the dangerous regime. Superalignment means to use the AI to solve alignment for us. I think there are various reasons why I expect that superalignment won't work. It would take a longer argument to build up a strong intuition why superalignment won’t work, but for one I don’t quite know if superalignment would be practically distinguishable from recursive self-improvement.
Anthropic has a document called “Core Views on AI Safety” in which they discuss the possibility of “pessimistic scenarios”. They explain that if they see evidence of what they would say supports the idea that alignment is very hard, they would possibly call for a pause to scaling. They don't exactly explain how this would be operationalized.
I do think that in the best possible case, they may create powerful evidence that leads us to a pause eventually. However, I don’t know if this is any more likely than the possibility that their current alignment methods help push problems under the rug. Their alignment methods could also accelerate the deployment of AI systems by eliminating small misbehaviors.
Let’s imagine there was a powerful warning shot, it could also be the case that the narrative advanced by Anthropic might be used against people calling for a pause. Instead of pausing development, they might advocate for more safety evaluations while continuing scaling and developing AI.
In conclusion, I think it’s fundamentally dangerous to build superintelligence when you don't have a plan for how to do it. While Anthropic positions itself on the side of safety, the narrative it creates is dangerously capable of misleading people.