My default reaction to this post is pretty strongly negative, because I feel that it doesn’t adequately engage with the opposing point-of-view, and why people who work on evaluations may think that their work is impactful and good. You are not required to do that before posting your opinion, but I think if you had tried slightly more to pass the ITT of people who work on evaluations, this post would read quite differently and perhaps more convincingly.
In any case, to object to some specific points:
Regulations do exist. I don’t understand the objection to evaluations being that they rely on regulations that don’t exist. It may be that you think the regulations are inadequate, but it is simply not the case that “regulations don’t exist.” Work by evaluators was, in fact, often quite instrumental in getting these regulations passed and taken seriously and implemented, as far as I'm aware.
Some examples: the EU AI act, New York’s RAISE act, Texas’ Responsible AI Governance act, California’s SB 53. South Korea has passed an AI Framework Act. The situation in China is harder to know about in detail. Even in the absence of codified “AI regulation,” for example, in the UK, existing regulators often can apply pre-existing laws to AI – even if these don’t cover the harms we are most concerned about, they are helpful on the margin, and further progress is being made here (led by UK AISI!)
I’m more sympathetic to your second objection, but I think this mostly depends on the jurisdiction. In the common-law frameworks with which I am most familiar, the burden of proof simply is on the public to make the case to legislators to pass AI regulation. Otherwise it is unregulated, and thus there are no legal requirements specific to AI. So yes, if you want to get AI regulated, you need to convince decision-makers that AI has to be regulated. Naturally, I agree with you that we ought to be significantly more conservative in the development of AI than we currently are, but nevertheless if you want to get regulations, you need to pass them (often in the face of overwhelming lobbying efforts against such regulations by AI and tech firms).
It is hard to draw an analogy here – perhaps a good one would be drug discovery in the US. Before the FDA existed and put the burden of proof on those bringing drugs to market, there were no regulations on selling drugs, people could sell whatever they wanted to. Over time, in the early-to-mid 20th century, more regulations were brought in. Eventually in 1962, the government gave the FDA much more power, and forced drug manufacturers to prove both efficacy and safety before bringing a drug to market, however the only reason it was able to establish an organization with such teeth and put the onus on drug manufacturers was due to public outcry over a specific disastrous case and because the groundwork had already been laid by prior regulations empowering the USDA. We’re trying to speedrun this process with AI, and evaluations are going to form a part of any sufficient regulatory framework. Perhaps you would say that there is no way for us to prove AI systems are safe and we need to pause, and I would be deeply sympathetic. Unfortunately, I’m pessimistic about the prospects of such a suggestion, and so I would rather lay the groundwork for an effective and powerful regulatory regime.
There are two more specific points that I would take exception to:
1) That Goodfire is well-described as “a startup leveraging interpretability for capabilities.” I think this is a part of what Goodfire does – and I wish they weren’t doing quite as much of that as they are – though this is certainly not the full story, and I think this is an unfair descriptor as currently written (i.e. it would lead someone unfamiliar with Goodfire to have false beliefs about them).
2) The criticism of the revolving door culture. This is often the case in regulatory setups (e.g. the SEC, the Treasury, and Wall Street). This makes sense! You want good economists and people who actually understand financial systems to be the ones regulating them, and this is inevitably going to lead to something of a revolving-door culture. It’s not optimal, but this is the world we live in. Expertise is unfortunately rare. I think it is better to criticize specific decisions made by evaluations companies, or decisions they haven’t taken, rather than pointing at associations between people and inferring harmful behavior without evidence (I think your example of people at evaluations orgs being concerned about losing API access if they speak out is a much better criticism here). In general I think this sort of guilt-by-association is not very fruitful.
Thanks for engaging. I have karma upvoted but disagreed voted.
You wrote a lot, so I'll focus on your two bounded points.
About Goodfire
I think you are wrong. If I had to guess, it's in the same way that people are misguided about Anthropic.
For comparison, this is the copy at the top of Anthropic's Research page:
Our research teams investigate the safety, inner workings, and societal impacts of AI models – so that artificial intelligence has a positive impact as it becomes increasingly capable.
This is their copy at the top of Goodfire's About page:
Goodfire is a research company using interpretability to understand, learn from, and design AI systems. Our mission is to build the next generation of safe and powerful AI—not by scaling alone, but by understanding the intelligence we're building.
And at the top of their Research page:
Fundamental interpretability research to understand and intentionally design advanced AI systems
They specifically state their wish to design advanced and the next generation of AI systems. To the extent I may have mis-characterised them in that people may misunderstand that they are not just doing capabilities on top of existing systems. What I have described could lead to people thinking they are just doing RAG or something.
About Revolving Doors
I think it is better to criticize specific decisions made by evaluations companies, or decisions they haven’t taken, rather than pointing at associations between people and inferring harmful behavior without evidence (I think your example of people at evaluations orgs being concerned about losing API access if they speak out is a much better criticism here). In general I think this sort of guilt-by-association is not very fruitful.
I think you are fractally wrong in this paragraph. I'll try to show it to you.
1) "It is better to [A] than [B]" is a false dilemma. In the article, as you point out, I just do both.
2) "Pointing associations between people and inferring harmful behavior without evidence" does not describe what I am doing, again as you point out.
3) It is crucial to point association between people and infer harmful behaviour from it even without confirmation! One should not reject evidence just because it is not "very fruitful" when used by itself. It is obviously relevant that the head of safety of US-AISI, the CEO of Open Phil, and the CEO of Anthropic were all roommates, and it is trivial to infer things from that.
4) I literally wrote "This is not only about having a couple of senior staff from the industry. That in itself can be good! It’s the whole picture that looks bad.", and then proceeded to list the whole picture.
About Goodfire
So probably this cashes out as more fundamental generator disagreements that aren't worth hashing out here. Broadly I think it's okay for a company to say "We are designing and advancing the next generation of AI systems" and I think to analyze whether they should be bucketed with capabilities labs like OpenAI or Anthropic (which I think are also meaningfully different places), one should look at and critically assess their research output.
Like, if someone believes that interpretability with be both helpful to build better systems and helpful to build safer systems, I think it's justifiable for them to do the thing that builds better systems in the hopes that those systems are also safer than the next-best-thing that would've been built (and that's probably reliant on a bunch of other beliefs where we may differ as I said before).
About revolving doors
I think this post overstates its case, but it makes important under-discussed points!
Here's another argument against evals that resonates with me:
One big reason why people do evals is that they see evals as completely cooperative and non-adversarial, whereas regulation and advocacy could cause conflict and polarization. If all you're doing is explaining facts, no one should regret interacting with you.
I think that's not quite right in the case of AI wakeup and AI regulation. It's obvious that in retrospect, all ambitious or powerful people will wish they had made AGI their top priority some time in the now-past, such as 2023. So if, in your honest and cooperative communication, you cause people to be less afraid and more complacent about AI, they will likely regret that interaction in retrospect! At some point later, they will realize they were grossly underrating AI, lacked sufficient context to effectively reason about AI, and will probably wish that someone had pushed them harder to prioritize AI safety. In that sense, it's actually more coooperative to selectively share information that increases wakeup and willingness to take costly safety actions, not unbiased information. Of course this consequentialist reasoning is not a solid basis for decision making, but it does make me believe that telling someone who isn't intimately familar with AI "actually Y model can't cause catastrophe Z yet" is not really a mutually beneficial interaction.
This is not a relationship of “These guys are building systems that may cause humanity’s extinction and we must stop them.”, and it’s not even one of “There are clear standards that corporations must abide by, or else.”
It is one of “We are their subordinate and depend on access to their APIs. We hope that one day, our work will be useful in helping them not deploy dangerous systems. In the meantime, we de facto help with their PR.”
Even assuming your latter claim about AI evals orgs is entirely true, isn't this enough to make evals organizations useful?
Any AI regulations that actually constrain AI companies must depend on an enforcement mechanism. For the lawmakers working on such legislation, it is convenient if such enforcement mechanism already exists and can simply be activated. Therefore, the existence of AI evals orgs makes it easier to pass legislation to constrain AI companies and improve their safety - with the orgs already in place and researching effective AI safety evals, the only thing the law has to effectively change is to force the AI labs to work with AI evals orgs rather than having it be voluntary. Isn't this pretty much the same direction that those funding and working at these organizations today would hope for?
I care about preventing extinction risks from superintelligence. This de facto makes me part of the “AI Safety” community, a social cluster of people who care about these risks.
In the community, a few organisations are working on “Evaluations” (which I will shorten to Evals). The most notable examples are Apollo Research, METR, and the UK AISI.
Evals make for an influential cluster of safety work, wherein auditors outside of the AI Corporations racing for ASI evaluate the new AI systems before they are deployed and publish their findings.
Evals have become a go-to project for people who want to prevent extinction risks. I would say they are the primary project for those who want to work at the interface of technical work and policy.
Incidentally, Evals Orgs consistently avoid mentioning extinction risks. This makes them an ideal place for employees and funders who care about extinction risks but do not want to be public about them. (I have written about this dynamic in my article about The Spectre.)
Sadly, despite having taken so much prominence in the “AI Safety” community, I believe that the Evals project is harmful. I believe that it should not receive further attention and investment, and consider plausible that it should be interrupted.
I am not exaggerating for shock value. This article will explain why I think Evals are harmful. My thinking primarily relies on three beliefs:
1) The Theory of Change behind Evals is broken.
2) Evals move the burden of proof away from AI Corporations.
3) Evals Organisations are not independent of AI Corporations, despite claiming otherwise.
—
While Evals Orgs have produced studies that we sometimes mentioned at ControlAI, they have always been much less central to our work than the Center for AI Safety’s statement. Indeed, the top AI experts explicitly warning about extinction risks is more useful than decontextualised technical results.
Even when we use this type of results, we rarely mention Evals Orgs anymore. We now tend to use Palisade’s report on resistance to shutdown or Anthropic’s results on blackmail.
From my point of view, when factoring their negative externalities, Evals clearly do not justify the prominence they have and the resources they command. With all that said…
1) The Theory of Change behind Evals is broken
Briefly put, Evals only make sense in the presence of regulations which do not exist, and they crowd out effort at passing such regulations.
—
It is usually quite hard to debunk the plans of an organisation. This is because said plans are rarely laid out for everyone to see. However, Apollo Research has carefully laid out their theory of change in a document, for which I am very thankful.
Inspecting it though, its core assumptions are clearly wrong! Here are the first two:
There is no such regulation.
Given their non-existence, it is astonishing to me that people care so much about Evals instead of advocating for regulations.
Evals are entirely dependent on the existence of such regulation.
Even worse, as I will show later, Evals Orgs have put themselves in a position where their incentives are sometimes to fight alongside AI Corps against said regulations.
—
Specifically, Evals Orgs all rely on the assumption that the development and/or deployment of systems with dangerous capabilities is prevented.
From Apollo Research:
From METR:
From the UK AISI’s “Approach to Evaluations” document:
In other words, the work of Evals Orgs only makes sense if AI Corporations are forbidden from deploying systems with dangerous capabilities, and if said capabilities are not too dangerous before deployment.
Their work is thus dependent on other people working hard to make it illegal to develop and deploy AI systems with dangerous capabilities.
In practice, as far as I am aware, no company was ever forced in any way as the result of external Evaluations. I believe there never was a model blocked, postponed or constrained before deployment, let alone during development.
As a result, it seems clear to me that until we actually ban “dangerous capabilities”, their work is not worth much.
2) Evals move the burden of proof away from AI Corporations
So far, I have mostly focused on the fact that the theory of change behind Evals is broken. But I believe that Evals Orgs are actually harmful.
—
First, let’s give some context on extinction risks from AI.
In 2023, the top experts in the field warned about the risk of extinction from AI. However, although most agree that there are risks of extinction, there is agreement (let alone consensus) on little else.
The top AI experts disagree wildly on the probability of said extinction, on when the first AGI systems may be built, on how to make AGI systems safe, and as METR itself notes: even on the definition of AGI.
These are all the signs of a pre-paradigmatic field, where experts cannot even agree on what the facts of the matter are. When despite this, experts nevertheless warn about the literal extinction of humanity, it stands to reason that conservatism is warranted.
In other words, AI Corps should not be allowed to pursue R&D agendas that risk killing everyone until we figure out what is going on. If they nevertheless want to continue, they ought to prove beyond a shadow of doubt that what they are doing will not kill everyone.
If there are reasonable disagreements among experts about whether an R&D program is about to lead to human extinction, that should absolutely be enough warrant to interrupt it.
In my personal experience, this line of reasoning is obvious to lay people and many policy makers.
Still in my personal experience: the closer someone is to the sphere of influence of AI Corps, the less obvious conservatism is to them.
—
Incidentally, Evals Orgs reverse this principle. They start with the assumption that AI Corps should be allowed to continue unimpeded, until a third party can demonstrate that specific AI system is dangerously capable.
This is a complete reversal of the burden of proof! Evals Orgs put on the public the onus to prove that a given AI System is dangerously capable. To the extent that they recommend something is done, it is only in the case the public detects something is wrong.
This has it exactly backwards.
The top AI Experts have already warned about the extinction risks of AI systems. Many are forecasting scenarios where the risks are concentrated in development rather than deployment.
Evals Orgs themselves admit that they cannot establish the safety of an AI system! For instance, the UK AISI straightforwardly states:
In this context, of course, AI Corps should be the ones who establish that their R&D programs are not likely to cause human extinction. It shouldn’t be that third party evaluators demonstrate that individual systems are free of risks.
—
As established in the first section, Evals only make sense in the context of constraining regulations. But instead, they have diverted attention and resources away from the work on such regulations.
Furthermore, not only did they divert resources away from what was needed, they have been actively harmful. Their work is about alleviating the burden of proof of AI Corps, and instead punting it onto the public, through NGOs and government agencies.
3) Evals Organisations are not independent of the AI Corporations
Finally, Evals Orgs have been harmful by conveying a false sense of independence from AI Corps. In my experience, their silence on matters of extinction is taken as neutral confirmation that the situation is not urgent with regard to AI Corps.
For context: all of them loudly proclaim the importance of “external”, “independent” or “third-party” evaluators.
Apollo’s document mentions 9 reasons for why external evaluators are important.
METR puts in bold “that the world needs an independent third-party” in their mission statement.
The UK AISI states clearly “We are an independent evaluator” in their “Approach to Evaluations” document.
But unfortunately, Evaluators are not independent, not even close:
1) In practice, their incentives are structured so that they are dominated by AI Corporations. We are far from the standard of evaluators having leverage over the corporations.
2) Their staff is deeply intertwined with that of AI Corporations.
—
On the first point, the AI Corporations decide on whether they have access to APIs, the timing of the access, and the NDA terms.
The CEO of METR was quite candid about this dynamic in an 80K interview:
And yet, the Evals Orgs proudly showcase the AI Corporations they work with, deeming them “Partners”, on their home page.
They are proud to work with them, and how many of the AI Corps will work with them is a social measure of their success.
While the UK AISI doesn’t have a Partners page, it has proudly partnered with ElevenLabs to “explore the implications of AI voice technology”, or Google DeepMind as “an important part of [their] broader collaboration with the UK Government on accelerating safe and beneficial AI progress”.
This “partnership” structure creates obvious problems. Insiders have told me that they can’t say or do anything publicly against AI Corporations, else they would lose their API access.
This is not a relationship of “These guys are building systems that may cause humanity’s extinction and we must stop them.”, and it’s not even one of “There are clear standards that corporations must abide by, or else.”
It is one of “We are their subordinate and depend on access to their APIs. We hope that one day, our work will be useful in helping them not deploy dangerous systems. In the meantime, we de facto help with their PR.”
—
Before moving on to the next point, let’s explain why we need the staff of third party Evals Organisations to be independent from that of the AI Corporations that they wish to regulate.
To be extra-clear, this is not about any single individual being “independent” or not, whatever this may mean. The considerations around independence are structural. Namely, we want to ensure that…
And suffice it to say, Evals Orgs do not ensure any of the above.
On Apollo’s side, two cofounders of Apollo left for Goodfire (a startup leveraging interpretability for capabilities, raising $200M in the process). Apollo was also initially funded by Open Philanthropy, who also funded OpenAI. Speaking of which, a couple of its staff worked at OpenAI and I know of one who left for Google DeepMind.
On METR’s side, its CEO formerly worked at both DeepMind and OpenAI. The other person listed in its leadership section is an ex-OpenAI too. Furthermore, they described their own work on Responsible Scaling Policies “as a first step safety-concerned labs could take themselves, rather than designed for policymakers”!
For the UK AISI, I will quote its About page:
The same can be found with the (now repurposed) US AISI, whose head of safety worked at OpenAI and was housemates with the CEO of Anthropic.
—
When I describe the situation to outsiders, to people who are not in AI or AI Safety, they are baffled.
This is not only about having a couple of senior staff from the industry. That in itself can be good! It’s the whole picture that looks bad.
Evals Organisations ought to be regulating AI Corps. But instead, they use taxpayers’ money and philanthropic funds to do testing for them for free, with no strings attached, and AI Corps give up virtually nothing in exchange.
They are proud to publicly partner with them, and they depend on them to continue their activities.
Both through revolving doors and the personal relationships of their employees, they are culturally and socially deeply intertwined with AI Corps.
And yet, at the same time, they all tout the importance of independence and neutrality. This is what makes the situation baffling.
Conclusion
I would summarise the situation as:
Despite all of this, Evals are one of the (if not the) most popular projects in AI Safety. They are my canonical example of the too-clever-by-half failures from the AI Safety Community.
—
If you fund or work on Evaluations to help with extinction risks, I would strongly invite you to re-evaluate whether your money and time are not better spent elsewhere.
As an advisor to ControlAI, I would naturally suggest ControlAI as an alternative. If not ControlAI, I would recommend pursuing endeavours similar in spirit to ControlAI’s Direct Institutional Plan: education on ASI, extinction risks, and what policies are necessary to deal with them. This could be done by founding your own organisation to inform lawmakers, or by partnering with MIRI and PauseAI on their like-minded initiatives.
—
Overall, I believe that the AI Safety Community would have been and would still be much better off if the people in the Evals cluster stopped playing 4D chess games with AI Corps and started informing the public (lay people and policy makers alike) about the risks of extinction and the necessity of banning ASI.
People in the AI Safety Community are confused about this topic. I am regularly told that Evals organisations care about extinction risks to humanity. And yet.
The UK AISI website brings 0 result on Google for “extinction”. METR’s brings only 2, and Apollo’s a single one.
This is a sharp example of The Spectre: the dynamic wherein the “AI Safety” community keeps coming up with alternatives to straightforward advocacy on extinction risks and a ban of superintelligence.
On this, cheers!