Evaluations of large language models (“model evals”) are one of the most commonly discussed AI governance ideas. The idea is relatively straightforward: we want to be able to understand if a model is dangerous. In order to do so, we should come up with tests that help us determine whether or not the model is dangerous. 

Some people working on model evals appear to operate under a paradigm in which the “burden of proof” is on the evaluation team to find evidence of danger. If the eval team cannot find evidence that the model is dangerous, by default the model is assumed to be safe.

I think this is the wrong norm to adopt and a dangerous precedent to set. 

The burden of proof should be on AI developers, and they should be required to proactively provide evidence that their model is safe. (As opposed to a regime where the burden of proof is on the independent eval team, and they are required to proactively provide evidence that the model is dangerous).

Some reasons why I believe this:

  1. The downsides are much more extreme than the upsides. The potential downsides (the complete and permanent destruction or disempowerment of humanity) are much larger than the potential upsides (deploying safe AI a few years or decades earlier). In situations where the downsides are much more extreme than the upsides, I think a more cautious approach is warranted.
  2. We might not be able to detect many kinds of dangerous capabilities or misalignment. There is widespread uncertainty around when dangerous capabilities or misalignment properties will emerge, if they will even emerge in time for them to be detectable, and if we will advance “the science of model evals” quickly enough to be able to detect them. It seems plausible to me (as well as many technical folks who are working on evals) that we might never get evals that are robust enough to reliably detect all or nearly-all the possible risks. 
    1. Many stories of AI takeover involve AI models with incentives to hide undesirable properties, deceive humans, and seek power in difficult-to-detect ways. In some threat models, this may happen rather suddenly, making these properties even harder to detect. 
    2. It’s possible that AI progress will be sufficiently gradual and failures will be easy-to-notice. Several smart people believe this, and I don’t think this position is unreasonable. I do think it’s extremely risky to gamble on this position, though. If it’s even non-trivially plausible that we won’t be able to detect the dangers in advance, we should manage this risk by shifting the burden of proof.
  3. AI developers will have more power and resources than independent auditors. I expect that many of the best evals and audits will come from teams within AI labs. If we rely on independent auditing groups, it’s likely that these groups will have fewer resources, less technical expertise, less familiarity with large language models, and less access to cutting-edge models compared to AI developers. As a result, we want the burden of proof to be on AI developers
    1. Note an analogy with the pharmaceutical industry, where pharma companies are powerful and well-resourced. The FDA does not rely on a team of auditors to assess whether or not a medical discovery is dangerous. Rather, the burden of proof is on the pharma companies. The FDA requires the companies to perform extensive research, document and report risks, and wait until the government has reviewed and approved the drug before it can be sold. (This is an oversimplification and makes the process seem less rigorous than it actually is; in reality, there are multiple phases of testing, and companies have to receive approvals at each phase before progressing). 

The burden of proof matters. I think we would be substantially safer if humanity expected AI developers to proactively show that their models were safe, as opposed to a regime where independent auditors had to proactively identify dangers. 

I’ll also note that I’m excited about a lot of the work on evals. I’m glad that there are a few experts who are thinking carefully about how to detect dangers in models. To use the FDA analogy, it’s great that some independent research groups are examining the potential dangers of various drugs. It would be a shame if we put all of our faith in the pharma companies or in the FDA regulators. 

However, I’ve heard some folks say things like “I think companies should be allowed to deploy models as long as ARC evals can’t find anything wrong with it.” I think this is pretty dangerous thinking, and I’m not convinced that AI safety advocates have to settle for this. 

Could we actually get a setup like the FDA, in which the US government requires AI developers to proactively provide evidence that their models are safe?

I don’t claim that the probability of this happening (or being well-executed) is >50%. But I do see it as an extremely important area of AI governance to explore further. The Overton Window has widened a lot in a rather short period of time, AI experts report concern around existential risks the public supports regulations, and policymakers are starting to react.

Perhaps we’ll soon learn that hopes for a government-run regulatory regime were naive dreams. Perhaps the only feasible proposal will be evals in which the burden of proof is on a small team of auditors. But I don’t think the jury is out yet.

In the meantime, I suggest pushing for ambitious policy proposals. For evals[1] to state it one final time: The burden of proof should be on frontier AI developers, they should be required to proactively provide evidence that their model is safe, and this evidence should be reviewed by a government body (or government-approved body).[2]

  1. ^

    This post focuses on evals of existing models. It seems likely to me that a comprehensive FDA-like regulatory regime would also require evals of training runs before training begins, but I’ll leave that outside the scope of this post.

  2. ^

    A few groups are currently performing research designed to answer the question “what kind of evidence would allow us to confidently claim, beyond a reasonable doubt, that a model is safe?” Right now, I don’t think we have concrete answers, and I’m excited to see this research progress. One example of a criterion might be something like “we have sufficiently strong interpretability: we can fully or nearly-fully understand the decision-making processes of models, we have clear and human-understandable explanations to describe their cognition, and we have a strong understanding of why certain outputs are produced in response to certain inputs. Unsurprisingly, I think the burden of proof should be on companies to develop tests that can prove that their models are safe. Until they can, we should err on the side of caution.

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 10:41 PM

I agree with the overall conclusion that the burden of proof should be on the side of the AGI companies. 

However, using the FDA as a reference or example might not be so great because it has historically gotten the cost-benefit trade-offs wrong many times and e.g. not permitting medicine that was comparatively safe and highly effective. 

So if the association of AIS evals or is similar to the FDA, we might not make too many friends. Overall, I think it would be fine if the AIS auditing community is seen as generally cautious but it should not give the impression of not updating on relevant evidence, etc. 

If I were to choose a model or reference class for AI auditing, I would probably choose the aviation industry which seems to be pretty competent and well-regarded. 

That seems like an excellent angle to the issue - I agree that reference models and stakeholders' different attitudes towards them likely have a huge impact.  As such, the criticisms the FDA faces might indeed be an issue! (at least that's how I understand your comment); 

However, I'd carefully offer a bit of pushback on the aviation industry as an example, keeping in mind the difficult tradeoffs and diverging interests regulators will face in designing an approval process for AI systems. I think the problems that regulators will face are more similar to those of the FDA & policymakers (if you assume they are your audience) might be more comfortable with a model that can somewhat withstand these problems. 

Below my reasoning (with a bit of an overstatement/ political rhetoric e.g., "risking peoples live")

As you highlighted, FDA is facing substantial criticism for being too cautious, e.g., with the Covid Vaccine taking longer to approve than the UK. Not permitting a medicine that would have been comparatively safe and highly effective, i.e., a false negative, can mean that medicine could have had a profound positive impact on someone's life. And beyond the public interest, industry has quite some financial interests in getting these through too. In a similar vein, I expect that regulators will face quite some pushback when "slowing" innovation down, i.e. not approving a model. On the other side, being too fast in pushing drugs through the pipeline is also commonly criticized (e.g., the recent Alzheimer's drug approval as a false positive example). Even more so, losing its reputation as a trustworthy regulator has a lot of knock-on effects. (i.e., will people trust an FDA-approved vaccine in the future?).  As such, both being too cautious and being too aggressive have both potentially high costs to people's lives, striking the right balance is incredibly difficult.

The aviation industry also faces a tradeoff, but I would argue, one side is inherently "weaker" than the other (for lack of a better description). In case something bad happens, there are huge reputational costs to the regulator if they had invested "too little" into safety.  A false negative error, however, i.e., overestimating the level of caution required and demanding more safety than necessary, does not necessarily negatively impact the reputation of the regulator; there are more or less only economic costs.  And most people seem to be okay with high safety standards in aviation. In other words & simplified, "overinvesting" in safety comes at an economic cost, and "underinvesting" in safety comes at reputational costs to the regulator and potentially people's live. 

My guess is that the reputational risks (& competing goals) that AI regulators will face, in particular in regards to the false negatives, are similar to those of the FDA. They will either be too cautious/ interventionist /innovation-hampering or too aggressive, if not both.  Aviation safety is (in my perception) rarely seen as too cautious (or at least nothing that get's routinely criticised by the public). 

Policy-makers - especially those currently "battling big tech" - are quite well aware of these tradeoffs they will face and the breath of stakeholders involved. As such, using an example that can withstand the reputational costs of applying too much caution might be a bit more powerful in some cases. In a similar vein, the FDA model is much more probed regarding capture (not getting one drug approved is incredibly costly for a single firm not for the whole industry, while industry-wide costs from safety restrictions in aviation can be passed on to consumers). 

Nonetheless, I completely understand the concern that "we might not make too many friends,", particularly among those focused on typical "pro-innovation considerations" or industry interests and that it makes sense to use this example with some caution.

This makes sense. Can you say more about how aviation regulation differs from the FDA?

In other words, are there meaningful differences in how the regulatory processes are set up? Or does it just happen to be the case that the FDA has historically been worse at responding to evidence compared to the Federal Aviation Administration? 

(I think it's plausible that we would want a structure similar to the FDA even if the particular individuals at the FDA were bad at cost-benefit analysis, unless there are arguments that the structure of the FDA caused the bad cost-benefit analyses).

So far, I haven't looked into it in detail and I'm only reciting other people's testimonials. I intend to dive deeper into these fields soon. I'll let you know when I have a better understanding.  

This reminds me of what Evan said here: https://www.lesswrong.com/posts/uqAdqrvxqGqeBHjTP/towards-understanding-based-safety-evaluations

My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is attempting to evade detection, while preventing deception from arising in the first place doesn't necessarily require that. As a result, it seems quite plausible to me that we could end up locking in a particular sort of evaluation framework (e.g. behavioral testing by an external auditor without transparency, checkpoints, etc.) that makes evaluating deception very difficult. If meeting such a standard then became synonymous with safety, getting labs to actually put effort into ensuring their models were non-deceptive could become essentially impossible.

Nice-- very relevant. I agree with Evan that arguments about the training procedure will be relevant (I'm more uncertain about whether checking for deception behaviorally will be harder than avoiding it, but it certainly seems plausible). 

Ideally, I think the regulators would be flexible in the kind of evidence they accept. If a developer has evidence that the model is not deceptive that relies on details about the training procedure, rather than behavioral testing, that could be sufficient.

(In fact, I think arguments that meet some sort of "beyond-a-reasonable-doubt" threshold would likely involve providing arguments for why the training procedure avoids deceptive alignment.)

On one hand, this makes a lot of sense, on another I'm worried that if the regulations are too rigorous then we might prevent a responsible actor from being the first to deploy AGI, in favour of less responsible ones.

I agree with this norm, though I think it would be better to say that the "burden of evidence" should be on labs. When I first read the title, I thought you wanted labs to somehow prove the safety of their system in a conclusive way. What this probably looks like in practice is "we put x resources into red teaming and didn't find any problems." I would be surprised if 'proof' was ever an appropriate term. 

If you think that AI development should be banned, you should say that.

Can you say more about what part of this relates to a ban on AI development?

I think the claim "AI development should be regulated in a way such that the burden of proof is on developers to show beyond-a-reasonable-doubt that models are safe" seems quite different from the claim "AI development should be banned", but it's possible that I'm missing something here or communicating imprecisely. 

Apologies, I was a bit blunt here.

It seems to me that the most obvious reading of "the burden of proof is on developers to show beyond-a-reasonable-doubt that models are safe" is in fact "all AI development is banned".  It's...not clear at all to me what a proof of a model being safe would even look like, and based on everything I've heard about AI Alignment (admittedly mostly from elsewhere on this site) it seems that no-one else knows either. 

A policy of 'developers should have to prove that their models are safe' would make sense in a world where we had a clear understanding that some types of model were safe, and wanted to make developers show that they were doing the safe thing and not the unsafe thing.  Right now, to the best of my understanding, we have no idea what is safe and what isn't.

If you have some idea of what a 'proof of safety' would look like under your system, could you say more about that?  Are there any existing AI systems you think can satisfy this requirement?  

From my perspective the most obvious outcomes of a burden-of-proof policy like you describe seem to be:

  • If it is interpreted literally and enforced as written, it will in fact be a full ban on AI development.  Actually proving an AI system to be safe is not something we can currently do.
  • Many possible implementations of it would not in fact ban AI development, but it's not clear that what they would do would actually relate to safety.  For instance, I can easily imagine outcomes like:
    • AI developers are required to submit a six-thousand-page 'proof' of safety to the satisfaction of some government bureau.  This would work out to something along the lines of 'only large companies with compliance departments can develop AI', which might be beneficial under some sets of assumptions that I do not particularly share?
    • AI developers are required to prove some narrow thing about their AI (e.g. that their AI will never output a racial slur under any circumstances whatsoever).  While again this might be beneficial under some sets of assumptions, it's not clear that it would in fact have much relationship to AI safety.