Why AI Evaluation Regimes are bad

PranavG; Gabriel Alfour

How the flagship project of the AI Safety Community ended up helping AI Corporations.

I care about preventing extinction risks from superintelligence. This de facto makes me part of the “AI Safety” community, a social cluster of people who care about these risks.

In the community, a few organisations are working on “Evaluations” (which I will shorten to Evals). The most notable examples are Apollo Research, METR, and the UK AISI.

Evals make for an influential cluster of safety work, wherein auditors outside of the AI Corporations racing for ASI evaluate the new AI systems before they are deployed and publish their findings.

Evals have become a go-to project for people who want to prevent extinction risks. I would say they are the primary project for those who want to work at the interface of technical work and policy.

Incidentally, Evals Orgs consistently avoid mentioning extinction risks. This makes them an ideal place for employees and funders who care about extinction risks but do not want to be public about them. (I have written about this dynamic in my article about The Spectre.)

Sadly, despite having taken so much prominence in the “AI Safety” community, I believe that the Evals project is harmful. I believe that it should not receive further attention and investment, and consider plausible that it should be interrupted.

I am not exaggerating for shock value. This article will explain why I think Evals are harmful. My thinking primarily relies on three beliefs:

1) The Theory of Change behind Evals is broken.
2) Evals move the burden of proof away from AI Corporations.
3) Evals Organisations are not independent of AI Corporations, despite claiming otherwise.

—

While Evals Orgs have produced studies that we sometimes mentioned at ControlAI, they have always been much less central to our work than the Center for AI Safety’s statement. Indeed, the top AI experts explicitly warning about extinction risks is more useful than decontextualised technical results.

Even when we use this type of results, we rarely mention Evals Orgs anymore. We now tend to use Palisade’s report on resistance to shutdown or Anthropic’s results on blackmail.

From my point of view, when factoring their negative externalities, Evals clearly do not justify the prominence they have and the resources they command. With all that said…

1) The Theory of Change behind Evals is broken

Briefly put, Evals only make sense in the presence of regulations which do not exist, and they crowd out effort at passing such regulations.

—

It is usually quite hard to debunk the plans of an organisation. This is because said plans are rarely laid out for everyone to see. However, Apollo Research has carefully laid out their theory of change in a document, for which I am very thankful.

Inspecting it though, its core assumptions are clearly wrong! Here are the first two:

1) Regulations demand external, independent audits. […]
2) Regulations demand actions following concerning evaluations. […]

There is no such regulation.

Given their non-existence, it is astonishing to me that people care so much about Evals instead of advocating for regulations.

Evals are entirely dependent on the existence of such regulation.

Even worse, as I will show later, Evals Orgs have put themselves in a position where their incentives are sometimes to fight alongside AI Corps against said regulations.

—

Specifically, Evals Orgs all rely on the assumption that the development and/or deployment of systems with dangerous capabilities is prevented.

From Apollo Research:

If successful, AI system evaluations would identify misaligned systems and systems with dangerous capabilities, thus helping to reduce the risk that such systems would be given affordances that let them have damaging effects on the world (e.g. deployment).
[…]
Such demonstrations could encourage these stakeholders to understand the gravity of the alignment problem and may convince them to propose regulation mandating safety measures or generally slowing down AI progress.

From METR:

METR’s mission is to develop scientific methods to assess catastrophic risks stemming from AI systems’ autonomous capabilities and enable good decision-making about their development.
[…]
We need to be able to determine whether a given AI system carries significant risk of a global catastrophe.

From the UK AISI’s “Approach to Evaluations” document:

On the second day of the Bletchley Summit, a number of countries, together with the leading AI companies, recognised the importance of collaborating on testing the next generation of AI models, including by evaluating for potentially harmful capabilities.
[…]
Our work informs UK and international policymaking and provide technical tools for governance and regulation.

In other words, the work of Evals Orgs only makes sense if AI Corporations are forbidden from deploying systems with dangerous capabilities, and if said capabilities are not too dangerous before deployment.

Their work is thus dependent on other people working hard to make it illegal to develop and deploy AI systems with dangerous capabilities.

In practice, as far as I am aware, no company was ever forced in any way as the result of external Evaluations. I believe there never was a model blocked, postponed or constrained before deployment, let alone during development.

As a result, it seems clear to me that until we actually ban “dangerous capabilities”, their work is not worth much.

2) Evals move the burden of proof away from AI Corporations

So far, I have mostly focused on the fact that the theory of change behind Evals is broken. But I believe that Evals Orgs are actually harmful.

—

First, let’s give some context on extinction risks from AI.

In 2023, the top experts in the field warned about the risk of extinction from AI. However, although most agree that there are risks of extinction, there is agreement (let alone consensus) on little else.

The top AI experts disagree wildly on the probability of said extinction, on when the first AGI systems may be built, on how to make AGI systems safe, and as METR itself notes: even on the definition of AGI.

These are all the signs of a pre-paradigmatic field, where experts cannot even agree on what the facts of the matter are. When despite this, experts nevertheless warn about the literal extinction of humanity, it stands to reason that conservatism is warranted.

In other words, AI Corps should not be allowed to pursue R&D agendas that risk killing everyone until we figure out what is going on. If they nevertheless want to continue, they ought to prove beyond a shadow of doubt that what they are doing will not kill everyone.

If there are reasonable disagreements among experts about whether an R&D program is about to lead to human extinction, that should absolutely be enough warrant to interrupt it.

In my personal experience, this line of reasoning is obvious to lay people and many policy makers.
Still in my personal experience: the closer someone is to the sphere of influence of AI Corps, the less obvious conservatism is to them.

—

Incidentally, Evals Orgs reverse this principle. They start with the assumption that AI Corps should be allowed to continue unimpeded, until a third party can demonstrate that specific AI system is dangerously capable.

This is a complete reversal of the burden of proof! Evals Orgs put on the public the onus to prove that a given AI System is dangerously capable. To the extent that they recommend something is done, it is only in the case the public detects something is wrong.

This has it exactly backwards.

The top AI Experts have already warned about the extinction risks of AI systems. Many are forecasting scenarios where the risks are concentrated in development rather than deployment.

Evals Orgs themselves admit that they cannot establish the safety of an AI system! For instance, the UK AISI straightforwardly states:

AISI’s evaluations are thus not comprehensive assessments of an AI system’s safety, and the goal is not to designate any system as “safe.”

In this context, of course, AI Corps should be the ones who establish that their R&D programs are not likely to cause human extinction. It shouldn’t be that third party evaluators demonstrate that individual systems are free of risks.

—

As established in the first section, Evals only make sense in the context of constraining regulations. But instead, they have diverted attention and resources away from the work on such regulations.

Furthermore, not only did they divert resources away from what was needed, they have been actively harmful. Their work is about alleviating the burden of proof of AI Corps, and instead punting it onto the public, through NGOs and government agencies.

3) Evals Organisations are not independent of the AI Corporations

Finally, Evals Orgs have been harmful by conveying a false sense of independence from AI Corps. In my experience, their silence on matters of extinction is taken as neutral confirmation that the situation is not urgent with regard to AI Corps.

For context: all of them loudly proclaim the importance of “external”, “independent” or “third-party” evaluators.

Apollo’s document mentions 9 reasons for why external evaluators are important.
METR puts in bold “that the world needs an independent third-party” in their mission statement.
The UK AISI states clearly “We are an independent evaluator” in their “Approach to Evaluations” document.

But unfortunately, Evaluators are not independent, not even close:
1) In practice, their incentives are structured so that they are dominated by AI Corporations. We are far from the standard of evaluators having leverage over the corporations.
2) Their staff is deeply intertwined with that of AI Corporations.

—

On the first point, the AI Corporations decide on whether they have access to APIs, the timing of the access, and the NDA terms.

The CEO of METR was quite candid about this dynamic in an 80K interview:

This is not the case. I wouldn’t want to describe any of the things that we’ve done thus far as actually providing meaningful oversight. There’s a bunch of constraints, including the stuff we were doing was under NDA, so we didn’t have formal authorisation to alert anyone or say if we thought things were concerning.

And yet, the Evals Orgs proudly showcase the AI Corporations they work with, deeming them “Partners”, on their home page.

They are proud to work with them, and how many of the AI Corps will work with them is a social measure of their success.

While the UK AISI doesn’t have a Partners page, it has proudly partnered with ElevenLabs to “explore the implications of AI voice technology”, or Google DeepMind as “an important part of [their] broader collaboration with the UK Government on accelerating safe and beneficial AI progress”.

This “partnership” structure creates obvious problems. Insiders have told me that they can’t say or do anything publicly against AI Corporations, else they would lose their API access.

This is not a relationship of “These guys are building systems that may cause humanity’s extinction and we must stop them.”, and it’s not even one of “There are clear standards that corporations must abide by, or else.”

It is one of “We are their subordinate and depend on access to their APIs. We hope that one day, our work will be useful in helping them not deploy dangerous systems. In the meantime, we de facto help with their PR.”

—

Before moving on to the next point, let’s explain why we need the staff of third party Evals Organisations to be independent from that of the AI Corporations that they wish to regulate.

To be extra-clear, this is not about any single individual being “independent” or not, whatever this may mean. The considerations around independence are structural. Namely, we want to ensure that…

The culture at Evals Orgs is different from that of AI Corporations. Else, they will suffer from the same biases, care about the same failure modes and test for the same things.
The social groups of Evals Orgs do not overlap too much with that of AI Corporations. Else, auditors will need to justify their assessments to look reasonable to their friends working in AI Corporations.
The career prospects at Evals Orgs and AI Corporations do not overlap. Else, criticising AI Corporations may directly hurt the careers of the people working at Evals Orgs.

And suffice it to say, Evals Orgs do not ensure any of the above.

On Apollo’s side, two cofounders of Apollo left for Goodfire (a startup leveraging interpretability for capabilities, raising $200M in the process). Apollo was also initially funded by Open Philanthropy, who also funded OpenAI. Speaking of which, a couple of its staff worked at OpenAI and I know of one who left for Google DeepMind.

On METR’s side, its CEO formerly worked at both DeepMind and OpenAI. The other person listed in its leadership section is an ex-OpenAI too. Furthermore, they described their own work on Responsible Scaling Policies “as a first step safety-concerned labs could take themselves, rather than designed for policymakers”!

For the UK AISI, I will quote its About page:

Our Chief Technology Officer Jade Leung is also the Prime Minister’s AI Advisor, and she previously led the Governance team at OpenAI.
Our Chief Scientist Geoffrey Irving and Research Director Chris Summerfield have collectively led teams at OpenAI, Google DeepMind and the University of Oxford.

The same can be found with the (now repurposed) US AISI, whose head of safety worked at OpenAI and was housemates with the CEO of Anthropic.

—

When I describe the situation to outsiders, to people who are not in AI or AI Safety, they are baffled.

This is not only about having a couple of senior staff from the industry. That in itself can be good! It’s the whole picture that looks bad.

Evals Organisations ought to be regulating AI Corps. But instead, they use taxpayers’ money and philanthropic funds to do testing for them for free, with no strings attached, and AI Corps give up virtually nothing in exchange.
They are proud to publicly partner with them, and they depend on them to continue their activities.
Both through revolving doors and the personal relationships of their employees, they are culturally and socially deeply intertwined with AI Corps.

And yet, at the same time, they all tout the importance of independence and neutrality. This is what makes the situation baffling.

Conclusion

I would summarise the situation as:

Evals Orgs use philanthropic and public funds to help AI Corps with their testing, for free, with no strings attached. There are virtually no constraints whatsoever on what AI Corps can do.
The incentives of Evals Orgs are not aligned with the public interest. In practice, Evals Orgs are subordinated to AI Corporations and must maintain good relationships with them in order to keep API access and continue their activities.
Expectedly, Evals Orgs have not pushed for an actual ban on the development of systems with dangerous capabilities or the interruption of R&D programs that may lead to human extinction.
Ironically, the theory of change behind Evals is predicated on regulation forbidding AI Corporations from developing and deploying systems with dangerous capabilities.

Despite all of this, Evals are one of the (if not the) most popular projects in AI Safety. They are my canonical example of the too-clever-by-half failures from the AI Safety Community.

—

If you fund or work on Evaluations to help with extinction risks, I would strongly invite you to re-evaluate whether your money and time are not better spent elsewhere.

As an advisor to ControlAI, I would naturally suggest ControlAI as an alternative. If not ControlAI, I would recommend pursuing endeavours similar in spirit to ControlAI’s Direct Institutional Plan: education on ASI, extinction risks, and what policies are necessary to deal with them. This could be done by founding your own organisation to inform lawmakers, or by partnering with MIRI and PauseAI on their like-minded initiatives.

—

Overall, I believe that the AI Safety Community would have been and would still be much better off if the people in the Evals cluster stopped playing 4D chess games with AI Corps and started informing the public (lay people and policy makers alike) about the risks of extinction and the necessity of banning ASI.

People in the AI Safety Community are confused about this topic. I am regularly told that Evals organisations care about extinction risks to humanity. And yet.
The UK AISI website brings 0 result on Google for “extinction”. METR’s brings only 2, and Apollo’s a single one.

This is a sharp example of The Spectre: the dynamic wherein the “AI Safety” community keeps coming up with alternatives to straightforward advocacy on extinction risks and a ban of superintelligence.

On this, cheers!

I think this post overstates its case, but it makes important under-discussed points!

Here's another argument against evals that resonates with me:

One big reason why people do evals is that they see evals as completely cooperative and non-adversarial, whereas regulation and advocacy could cause conflict and polarization. If all you're doing is explaining facts, no one should regret interacting with you.

I think that's not quite right in the case of AI wakeup and AI regulation. It's obvious that in retrospect, all ambitious or powerful people will wish they had made AGI their top priority some time in the now-past, such as 2023. So if, in your honest and cooperative communication, you cause people to be less afraid and more complacent about AI, they will likely regret that interaction in retrospect! At some point later, they will realize they were grossly underrating AI, lacked sufficient context to effectively reason about AI, and will probably wish that someone had pushed them harder to prioritize AI safety. In that sense, it's actually more coooperative to selectively share information that increases wakeup and willingness to take costly safety actions, not unbiased information. Of course this consequentialist reasoning is not a solid basis for decision making, but it does make me believe that telling someone who isn't intimately familar with AI "actually Y model can't cause catastrophe Z yet" is not really a mutually beneficial interaction.

One big reason why people do evals is that they see evals as completely cooperative and non-adversarial

Unless the eval results require action that AI developers won't like

This is not a relationship of “These guys are building systems that may cause humanity’s extinction and we must stop them.”, and it’s not even one of “There are clear standards that corporations must abide by, or else.”
It is one of “We are their subordinate and depend on access to their APIs. We hope that one day, our work will be useful in helping them not deploy dangerous systems. In the meantime, we de facto help with their PR.”

Even assuming your latter claim about AI evals orgs is entirely true, isn't this enough to make evals organizations useful?

Any AI regulations that actually constrain AI companies must depend on an enforcement mechanism. For the lawmakers working on such legislation, it is convenient if such enforcement mechanism already exists and can simply be activated. Therefore, the existence of AI evals orgs makes it easier to pass legislation to constrain AI companies and improve their safety - with the orgs already in place and researching effective AI safety evals, the only thing the law has to effectively change is to force the AI labs to work with AI evals orgs rather than having it be voluntary. Isn't this pretty much the same direction that those funding and working at these organizations today would hope for?

My default reaction to this post is pretty strongly negative, because I feel that it doesn’t adequately engage with the opposing point-of-view, and why people who work on evaluations may think that their work is impactful and good. You are not required to do that before posting your opinion, but I think if you had tried slightly more to pass the ITT of people who work on evaluations, this post would read quite differently and perhaps more convincingly.

In any case, to object to some specific points:

Regulations do exist. I don’t understand the objection to evaluations being that they rely on regulations that don’t exist. It may be that you think the regulations are inadequate, but it is simply not the case that “regulations don’t exist.” Work by evaluators was, in fact, often quite instrumental in getting these regulations passed and taken seriously and implemented, as far as I'm aware.

Some examples: the EU AI act, New York’s RAISE act, Texas’ Responsible AI Governance act, California’s SB 53. South Korea has passed an AI Framework Act. The situation in China is harder to know about in detail. Even in the absence of codified “AI regulation,” for example, in the UK, existing regulators often can apply pre-existing laws to AI – even if these don’t cover the harms we are most concerned about, they are helpful on the margin, and further progress is being made here (led by UK AISI!)

I’m more sympathetic to your second objection, but I think this mostly depends on the jurisdiction. In the common-law frameworks with which I am most familiar, the burden of proof simply is on the public to make the case to legislators to pass AI regulation. Otherwise it is unregulated, and thus there are no legal requirements specific to AI. So yes, if you want to get AI regulated, you need to convince decision-makers that AI has to be regulated. Naturally, I agree with you that we ought to be significantly more conservative in the development of AI than we currently are, but nevertheless if you want to get regulations, you need to pass them (often in the face of overwhelming lobbying efforts against such regulations by AI and tech firms).

It is hard to draw an analogy here – perhaps a good one would be drug discovery in the US. Before the FDA existed and put the burden of proof on those bringing drugs to market, there were no regulations on selling drugs, people could sell whatever they wanted to. Over time, in the early-to-mid 20th century, more regulations were brought in. Eventually in 1962, the government gave the FDA much more power, and forced drug manufacturers to prove both efficacy and safety before bringing a drug to market, however the only reason it was able to establish an organization with such teeth and put the onus on drug manufacturers was due to public outcry over a specific disastrous case and because the groundwork had already been laid by prior regulations empowering the USDA. We’re trying to speedrun this process with AI, and evaluations are going to form a part of any sufficient regulatory framework. Perhaps you would say that there is no way for us to prove AI systems are safe and we need to pause, and I would be deeply sympathetic. Unfortunately, I’m pessimistic about the prospects of such a suggestion, and so I would rather lay the groundwork for an effective and powerful regulatory regime.

There are two more specific points that I would take exception to:

1) That Goodfire is well-described as “a startup leveraging interpretability for capabilities.” I think this is a part of what Goodfire does – and I wish they weren’t doing quite as much of that as they are – though this is certainly not the full story, and I think this is an unfair descriptor as currently written (i.e. it would lead someone unfamiliar with Goodfire to have false beliefs about them).

2) The criticism of the revolving door culture. This is often the case in regulatory setups (e.g. the SEC, the Treasury, and Wall Street). This makes sense! You want good economists and people who actually understand financial systems to be the ones regulating them, and this is inevitably going to lead to something of a revolving-door culture. It’s not optimal, but this is the world we live in. Expertise is unfortunately rare. I think it is better to criticize specific decisions made by evaluations companies, or decisions they haven’t taken, rather than pointing at associations between people and inferring harmful behavior without evidence (I think your example of people at evaluations orgs being concerned about losing API access if they speak out is a much better criticism here). In general I think this sort of guilt-by-association is not very fruitful.

Thanks for engaging. I have karma upvoted but disagreed voted.

You wrote a lot, so I'll focus on your two bounded points.

About Goodfire

I think you are wrong. If I had to guess, it's in the same way that people are misguided about Anthropic.

For comparison, this is the copy at the top of Anthropic's Research page:

Our research teams investigate the safety, inner workings, and societal impacts of AI models – so that artificial intelligence has a positive impact as it becomes increasingly capable.

This is their copy at the top of Goodfire's About page:

Goodfire is a research company using interpretability to understand, learn from, and design AI systems. Our mission is to build the next generation of safe and powerful AI—not by scaling alone, but by understanding the intelligence we're building.

And at the top of their Research page:

Fundamental interpretability research to understand and intentionally design advanced AI systems

They specifically state their wish to design advanced and the next generation of AI systems. To the extent I may have mis-characterised them in that people may misunderstand that they are not just doing capabilities on top of existing systems. What I have described could lead to people thinking they are just doing RAG or something.

About Revolving Doors

I think it is better to criticize specific decisions made by evaluations companies, or decisions they haven’t taken, rather than pointing at associations between people and inferring harmful behavior without evidence (I think your example of people at evaluations orgs being concerned about losing API access if they speak out is a much better criticism here). In general I think this sort of guilt-by-association is not very fruitful.

I think you are fractally wrong in this paragraph. I'll try to show it to you.

1) "It is better to [A] than [B]" is a false dilemma. In the article, as you point out, I just do both.

2) "Pointing associations between people and inferring harmful behavior without evidence" does not describe what I am doing, again as you point out.

3) It is crucial to point association between people and infer harmful behaviour from it even without confirmation! One should not reject evidence just because it is not "very fruitful" when used by itself. It is obviously relevant that the head of safety of US-AISI, the CEO of Open Phil, and the CEO of Anthropic were all roommates, and it is trivial to infer things from that.

4) I literally wrote "This is not only about having a couple of senior staff from the industry. That in itself can be good! It’s the whole picture that looks bad.", and then proceeded to list the whole picture.

About Goodfire

So probably this cashes out as more fundamental generator disagreements that aren't worth hashing out here. Broadly I think it's okay for a company to say "We are designing and advancing the next generation of AI systems" and I think to analyze whether they should be bucketed with capabilities labs like OpenAI or Anthropic (which I think are also meaningfully different places), one should look at and critically assess their research output.

Like, if someone believes that interpretability with be both helpful to build better systems and helpful to build safer systems, I think it's justifiable for them to do the thing that builds better systems in the hopes that those systems are also safer than the next-best-thing that would've been built (and that's probably reliant on a bunch of other beliefs where we may differ as I said before).

About revolving doors

Sorry I may have been unclear. What I meant is that [A] is good and [B] is bad. I am criticizing [B] here which is playing the associations-game. I think that is generally bad and you should not do it.
I think it describes some of what you are doing though not all of it, yes.
I am not saying one should reject it as evidence. It's fine to say it publicly. What I disagree with is the inference that is implied. I do not think it is trivial to infer things from that. I have had roommates with whom I disagree and I would say so publicly. It is, in my opinion, better to apply criticism to actual actions you disagree with, or strategic choices, or whatever, instead of who people were roommates with or not. Otherwise you get to games like what happened with your point about Goodfire/Apollo where we have: Apollo is suspect because it was co-founded by two people who then went to Goodfire and anyone who works at Goodfire is suspect because in some way they think what they're working on will be useful for AI development (even though I'm pretty sure those people continue to see their work as being heavily motivated by safety, and differentially useful for safety). I think this chain of inferences is bad, and so we should cut it at the root. Most professional criticism should be about what people have actually done, not about who they've been associated with.
It looks bad in the same way as other revolving door cultures can look bad to outsiders, but the question should be is it bad? Is it an actual structural problem? Or is it a PR problem? Or just a feature of the space? And I think for answering these questions, it is better to look at behaviour (which again, you did do as well as I mentioned originally, I just think the association-game stuff isn't great).

Genuine thanks for your response. I have again karma-upvoted and disagreed-voted.

About Goodfire

Broadly, I don't care much about the moral intents of OpenAI, Anthropic and Goodfire. I think everyone is the hero of their own story.

To the extent that people state their intents, I basically remove all the moral colours and emotional valence. For instance, "we build capabilities for The Good and for Good Reasons" becomes "we build capabilities".

Another example may be "we care a lot about rationality". Rationality is the practice of "reason", which is "thinking correctly". "Care" is also "we have good feelings for". So, through this process, it becomes "we have a lot of feelings around the practice of thinking".

A last example could be "effective altruism". Here, it's hard to see what's left after you go through this process. Something like "we do things connected to others (altruism) that perform highly on measures of our own choice (effective)".

I strongly recommend adopting this frame of analysis to any of my readers, not only "GenericModel".
If you are this deep in this thread, you most likely need it.

About revolving doors

On 3, I think you are forgetting the basics of LessWrong.

I am not saying one should reject it as evidence. It's fine to say it publicly. What I disagree with is the inference that is implied. I do not think it is trivial to infer things from that. I have had roommates with whom I disagree and I would say so publicly.

"If [X], then necessarily [Y] is true" is logical entailment.
Evidence is "If [X], then [Y] is more likely." A single counter-example does not disprove it.

If you mean that housemates (by choice, not necessity!) who have worked together are not more likely to be synchronised on their view, then I think you are clearly wrong.

Just so you know, Holden has moved from CEO of Open Philanthropy to Anthropic a year ago. This type of revolving door was entirely predictable through this type of evidence.

I have been polite in the article. Had I included public links of marriage, things would already look worse. Had a journalist included private affairs, the picture would look much worse. This would be entirely valid probabilistically.

On 4, I think you are forgetting that everything is adversarial.

it is better to look at behaviour

Yes, but people lie, they deceive, they "strategically withhold" information, and so on. Even when they don't, it also just takes time to document everything, and vet it for publication.

This lack of transparency is why we use other types of evidence. They tend to be less accurate, but also more plentiful, less filtered, and harder to fake.

It looks bad in the same way as other revolving door cultures can look bad to outsiders, but the question should be is it bad? Is it an actual structural problem?

Yes, and I have listed the adversarial considerations in the article. It is the part that literally starts with "The considerations around independence are structural. Namely, [...]"

I don't think the opening point here is very reasonable. (I think the latter points feel more reasonable)

Inspecting it though, its core assumptions are clearly wrong! Here are the first two:
1) Regulations demand external, independent audits. […]
2) Regulations demand actions following concerning evaluations. […]
There is no such regulation.

My current understanding is that getting good regulations is a chicken-egg problem. It's difficult to pass a regulation if there isn't yet the technical implementation details necessary to execute it. You can try to create broad FDA-like authorities that can invent the technical details later, but, that requires more political will and are more at risk at turning into molochian molasses without really doing anything.

One point of the Evals Plan was to lay the groundwork to make it possible to make regulations later.

That said, I agree something vibeswise pretty feels off about the Evals / AI Companies dynamic.

You might argue "predictably, the orgs require a culture that can interface with the labs, and this has spillover dynamics that make it less likely that real/useful regulation gets passed." But, this post didn't really connect those dots.

I think the thrust of this argument is definitely correct. I think EAs and the AI safety crowd should have taken a different tack back in the early 2020s, and I think they should change tack now. I also think this article slightly overreaches in ways which harm it's broader point.

There's pretty much an unassailable case along the lines of "all the money that went to METR, Apollo, etc. would have been better spent on at Palisade, or similar", but this post mixes in some unsupported claims like AISI being "repurposed" followed by this claim that particularly irked me:

... and was housemates with the CEO of Anthropic.

What is being implied here? This was nine years ago! If you want to make a claim that Paul Christiano is overly friendly with Anthropic in a way that affects his professional work, you have to actually make that claim with substantive evidence, not innuendo.

(Ok, yes, it's Bayesian evidence, but it's not very strong evidence, and if this is the strongest evidence you're presenting, I'll assume you don't have better)

Again, I note that this doesn't even strengthen the case very much, this whole section could be cut wholecloth from the article without damaging the important parts of its case very much.

I'd be happy with the standards of this post if I came across it in a newspaper, or on Substack, but it doesn't reach the bar I expect from LessWrong.

>(Okay, yes, it's Bayesian evidence, but it's not very strong evidence)

I think it's stronger than you're making it out to be, although i'm not entirely confident i can state the case well. It's something like... insofar as the claim is that evaluators are independent, they really aren't. it's all the same big social cluster, with similar beliefs and worldviews

evaluators really ought to minimize conflict of interest. if you're already in the scene, then there's sorta a sense in which it feels like the scene is all there is? or at least, i feel this sometimes. but it really would be better if there were some evaluation firms that were not so... part of the in-group of the social vibe inside the ai labs. if some random research team formed a competent evaluation organization that was entirely disconnected from that in-group, it would probably be a very good thing.

and if you've already got this mindset, then learning that christiano and amodei were housemates in the past is very strong evidence that aisi does not meet this higher bar for being truly independent of the ai labs

Going through your beliefs -

The EU AI Act's Code of Practise (Safety and Security chapter) mandates external evaluations for systemic risks. That's definitely a start - so regulations are getting there.
I think the way the Act is setup is that labs do their testing and external orgs add an extra perspective so it's not just labs high five-ing themselves.
Isn't some overlap in personnel to be expected considering that the AI Safety field is small?

Also, I fail to see how evals take away from passing new regulations. Evals are, like other work in this field, building tech that will be only adopted / impactful when complemented with governance / regulations or some other incentives.

I think this post overstates its case, but it makes important under-discussed points!

Here's another argument against evals that resonates with me:

One big reason why people do evals is that they see evals as completely cooperative and non-adversarial

Unless the eval results require action that AI developers won't like

This is not a relationship of “These guys are building systems that may cause humanity’s extinction and we must stop them.”, and it’s not even one of “There are clear standards that corporations must abide by, or else.”
It is one of “We are their subordinate and depend on access to their APIs. We hope that one day, our work will be useful in helping them not deploy dangerous systems. In the meantime, we de facto help with their PR.”

Even assuming your latter claim about AI evals orgs is entirely true, isn't this enough to make evals organizations useful?

In any case, to object to some specific points:

There are two more specific points that I would take exception to:

Thanks for engaging. I have karma upvoted but disagreed voted.

You wrote a lot, so I'll focus on your two bounded points.

About Goodfire

I think you are wrong. If I had to guess, it's in the same way that people are misguided about Anthropic.

For comparison, this is the copy at the top of Anthropic's Research page:

Our research teams investigate the safety, inner workings, and societal impacts of AI models – so that artificial intelligence has a positive impact as it becomes increasingly capable.

This is their copy at the top of Goodfire's About page:

Goodfire is a research company using interpretability to understand, learn from, and design AI systems. Our mission is to build the next generation of safe and powerful AI—not by scaling alone, but by understanding the intelligence we're building.

And at the top of their Research page:

Fundamental interpretability research to understand and intentionally design advanced AI systems

About Revolving Doors

I think it is better to criticize specific decisions made by evaluations companies, or decisions they haven’t taken, rather than pointing at associations between people and inferring harmful behavior without evidence (I think your example of people at evaluations orgs being concerned about losing API access if they speak out is a much better criticism here). In general I think this sort of guilt-by-association is not very fruitful.

I think you are fractally wrong in this paragraph. I'll try to show it to you.

1) "It is better to [A] than [B]" is a false dilemma. In the article, as you point out, I just do both.

2) "Pointing associations between people and inferring harmful behavior without evidence" does not describe what I am doing, again as you point out.

About Goodfire

About revolving doors

Sorry I may have been unclear. What I meant is that [A] is good and [B] is bad. I am criticizing [B] here which is playing the associations-game. I think that is generally bad and you should not do it.
I think it describes some of what you are doing though not all of it, yes.
I am not saying one should reject it as evidence. It's fine to say it publicly. What I disagree with is the inference that is implied. I do not think it is trivial to infer things from that. I have had roommates with whom I disagree and I would say so publicly. It is, in my opinion, better to apply criticism to actual actions you disagree with, or strategic choices, or whatever, instead of who people were roommates with or not. Otherwise you get to games like what happened with your point about Goodfire/Apollo where we have: Apollo is suspect because it was co-founded by two people who then went to Goodfire and anyone who works at Goodfire is suspect because in some way they think what they're working on will be useful for AI development (even though I'm pretty sure those people continue to see their work as being heavily motivated by safety, and differentially useful for safety). I think this chain of inferences is bad, and so we should cut it at the root. Most professional criticism should be about what people have actually done, not about who they've been associated with.
It looks bad in the same way as other revolving door cultures can look bad to outsiders, but the question should be is it bad? Is it an actual structural problem? Or is it a PR problem? Or just a feature of the space? And I think for answering these questions, it is better to look at behaviour (which again, you did do as well as I mentioned originally, I just think the association-game stuff isn't great).

Genuine thanks for your response. I have again karma-upvoted and disagreed-voted.

About Goodfire

Broadly, I don't care much about the moral intents of OpenAI, Anthropic and Goodfire. I think everyone is the hero of their own story.

I strongly recommend adopting this frame of analysis to any of my readers, not only "GenericModel".
If you are this deep in this thread, you most likely need it.

About revolving doors

On 3, I think you are forgetting the basics of LessWrong.

I am not saying one should reject it as evidence. It's fine to say it publicly. What I disagree with is the inference that is implied. I do not think it is trivial to infer things from that. I have had roommates with whom I disagree and I would say so publicly.

"If [X], then necessarily [Y] is true" is logical entailment.
Evidence is "If [X], then [Y] is more likely." A single counter-example does not disprove it.

If you mean that housemates (by choice, not necessity!) who have worked together are not more likely to be synchronised on their view, then I think you are clearly wrong.

Just so you know, Holden has moved from CEO of Open Philanthropy to Anthropic a year ago. This type of revolving door was entirely predictable through this type of evidence.

On 4, I think you are forgetting that everything is adversarial.

it is better to look at behaviour

Yes, but people lie, they deceive, they "strategically withhold" information, and so on. Even when they don't, it also just takes time to document everything, and vet it for publication.

This lack of transparency is why we use other types of evidence. They tend to be less accurate, but also more plentiful, less filtered, and harder to fake.

It looks bad in the same way as other revolving door cultures can look bad to outsiders, but the question should be is it bad? Is it an actual structural problem?

Yes, and I have listed the adversarial considerations in the article. It is the part that literally starts with "The considerations around independence are structural. Namely, [...]"

I don't think the opening point here is very reasonable. (I think the latter points feel more reasonable)

Inspecting it though, its core assumptions are clearly wrong! Here are the first two:
1) Regulations demand external, independent audits. […]
2) Regulations demand actions following concerning evaluations. […]
There is no such regulation.

One point of the Evals Plan was to lay the groundwork to make it possible to make regulations later.

That said, I agree something vibeswise pretty feels off about the Evals / AI Companies dynamic.

... and was housemates with the CEO of Anthropic.

(Ok, yes, it's Bayesian evidence, but it's not very strong evidence, and if this is the strongest evidence you're presenting, I'll assume you don't have better)

Again, I note that this doesn't even strengthen the case very much, this whole section could be cut wholecloth from the article without damaging the important parts of its case very much.

I'd be happy with the standards of this post if I came across it in a newspaper, or on Substack, but it doesn't reach the bar I expect from LessWrong.

Going through your beliefs -

The EU AI Act's Code of Practise (Safety and Security chapter) mandates external evaluations for systemic risks. That's definitely a start - so regulations are getting there.
I think the way the Act is setup is that labs do their testing and external orgs add an extra perspective so it's not just labs high five-ing themselves.
Isn't some overlap in personnel to be expected considering that the AI Safety field is small?

95

Why AI Evaluation Regimes are bad

95

1) The Theory of Change behind Evals is broken

2) Evals move the burden of proof away from AI Corporations

3) Evals Organisations are not independent of the AI Corporations

Conclusion

95

95