Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous)

[-]Marius Hobbhahn3y118

I agree with the overall conclusion that the burden of proof should be on the side of the AGI companies.

However, using the FDA as a reference or example might not be so great because it has historically gotten the cost-benefit trade-offs wrong many times and e.g. not permitting medicine that was comparatively safe and highly effective.

So if the association of AIS evals or is similar to the FDA, we might not make too many friends. Overall, I think it would be fine if the AIS auditing community is seen as generally cautious but it should not give the impression of not updating on relevant evidence, etc.

If I were to choose a model or reference class for AI auditing, I would probably choose the aviation industry which seems to be pretty competent and well-regarded.

[-]lisas3y*30

That seems like an excellent angle to the issue - I agree that reference models and stakeholders' different attitudes towards them likely have a huge impact. As such, the criticisms the FDA faces might indeed be an issue! (at least that's how I understand your comment);

However, I'd carefully offer a bit of pushback on the aviation industry as an example, keeping in mind the difficult tradeoffs and diverging interests regulators will face in designing an approval process for AI systems. I think the problems that regulators will face are more similar to those of the FDA & policymakers (if you assume they are your audience) might be more comfortable with a model that can somewhat withstand these problems.

Below my reasoning (with a bit of an overstatement/ political rhetoric e.g., "risking peoples live")

As you highlighted, FDA is facing substantial criticism for being too cautious, e.g., with the Covid Vaccine taking longer to approve than the UK. Not permitting a medicine that would have been comparatively safe and highly effective, i.e., a false negative, can mean that medicine could have had a profound positive impact on someone's life. And beyond the public interest, industry has quite some financial interests in getting these through too. In a similar vein, I expect that regulators will face quite some pushback when "slowing" innovation down, i.e. not approving a model. On the other side, being too fast in pushing drugs through the pipeline is also commonly criticized (e.g., the recent Alzheimer's drug approval as a false positive example). Even more so, losing its reputation as a trustworthy regulator has a lot of knock-on effects. (i.e., will people trust an FDA-approved vaccine in the future?). As such, both being too cautious and being too aggressive have both potentially high costs to people's lives, striking the right balance is incredibly difficult.

The aviation industry also faces a tradeoff, but I would argue, one side is inherently "weaker" than the other (for lack of a better description). In case something bad happens, there are huge reputational costs to the regulator if they had invested "too little" into safety. A false negative error, however, i.e., overestimating the level of caution required and demanding more safety than necessary, does not necessarily negatively impact the reputation of the regulator; there are more or less only economic costs. And most people seem to be okay with high safety standards in aviation. In other words & simplified, "overinvesting" in safety comes at an economic cost, and "underinvesting" in safety comes at reputational costs to the regulator and potentially people's live.

My guess is that the reputational risks (& competing goals) that AI regulators will face, in particular in regards to the false negatives, are similar to those of the FDA. They will either be too cautious/ interventionist /innovation-hampering or too aggressive, if not both. Aviation safety is (in my perception) rarely seen as too cautious (or at least nothing that get's routinely criticised by the public).

Policy-makers - especially those currently "battling big tech" - are quite well aware of these tradeoffs they will face and the breath of stakeholders involved. As such, using an example that can withstand the reputational costs of applying too much caution might be a bit more powerful in some cases. In a similar vein, the FDA model is much more probed regarding capture (not getting one drug approved is incredibly costly for a single firm not for the whole industry, while industry-wide costs from safety restrictions in aviation can be passed on to consumers).

Nonetheless, I completely understand the concern that "we might not make too many friends,", particularly among those focused on typical "pro-innovation considerations" or industry interests and that it makes sense to use this example with some caution.

[-]Orpheus163y31

This makes sense. Can you say more about how aviation regulation differs from the FDA?

In other words, are there meaningful differences in how the regulatory processes are set up? Or does it just happen to be the case that the FDA has historically been worse at responding to evidence compared to the Federal Aviation Administration?

(I think it's plausible that we would want a structure similar to the FDA even if the particular individuals at the FDA were bad at cost-benefit analysis, unless there are arguments that the structure of the FDA caused the bad cost-benefit analyses).

[-]Marius Hobbhahn3y20

So far, I haven't looked into it in detail and I'm only reciting other people's testimonials. I intend to dive deeper into these fields soon. I'll let you know when I have a better understanding.

[-]jacquesthibs3y42

This reminds me of what Evan said here: https://www.lesswrong.com/posts/uqAdqrvxqGqeBHjTP/towards-understanding-based-safety-evaluations

My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is attempting to evade detection, while preventing deception from arising in the first place doesn't necessarily require that. As a result, it seems quite plausible to me that we could end up locking in a particular sort of evaluation framework (e.g. behavioral testing by an external auditor without transparency, checkpoints, etc.) that makes evaluating deception very difficult. If meeting such a standard then became synonymous with safety, getting labs to actually put effort into ensuring their models were non-deceptive could become essentially impossible.

[-]Orpheus163y20

Nice-- very relevant. I agree with Evan that arguments about the training procedure will be relevant (I'm more uncertain about whether checking for deception behaviorally will be harder than avoiding it, but it certainly seems plausible).

Ideally, I think the regulators would be flexible in the kind of evidence they accept. If a developer has evidence that the model is not deceptive that relies on details about the training procedure, rather than behavioral testing, that could be sufficient.

(In fact, I think arguments that meet some sort of "beyond-a-reasonable-doubt" threshold would likely involve providing arguments for why the training procedure avoids deceptive alignment.)

[-]Chris_Leong3y2-5

On one hand, this makes a lot of sense, on another I'm worried that if the regulations are too rigorous then we might prevent a responsible actor from being the first to deploy AGI, in favour of less responsible ones.

[-]joshc3y10

I agree with this norm, though I think it would be better to say that the "burden of evidence" should be on labs. When I first read the title, I thought you wanted labs to somehow prove the safety of their system in a conclusive way. What this probably looks like in practice is "we put x resources into red teaming and didn't find any problems." I would be surprised if 'proof' was ever an appropriate term.

[-]aphyer3y1-3

If you think that AI development should be banned, you should say that.

[-]Orpheus163y42

Can you say more about what part of this relates to a ban on AI development?

I think the claim "AI development should be regulated in a way such that the burden of proof is on developers to show beyond-a-reasonable-doubt that models are safe" seems quite different from the claim "AI development should be banned", but it's possible that I'm missing something here or communicating imprecisely.

[-]aphyer3y40

Apologies, I was a bit blunt here.

It seems to me that the most obvious reading of "the burden of proof is on developers to show beyond-a-reasonable-doubt that models are safe" is in fact "all AI development is banned". It's...not clear at all to me what a proof of a model being safe would even look like, and based on everything I've heard about AI Alignment (admittedly mostly from elsewhere on this site) it seems that no-one else knows either.

A policy of 'developers should have to prove that their models are safe' would make sense in a world where we had a clear understanding that some types of model were safe, and wanted to make developers show that they were doing the safe thing and not the unsafe thing. Right now, to the best of my understanding, we have no idea what is safe and what isn't.

If you have some idea of what a 'proof of safety' would look like under your system, could you say more about that? Are there any existing AI systems you think can satisfy this requirement?

From my perspective the most obvious outcomes of a burden-of-proof policy like you describe seem to be:

If it is interpreted literally and enforced as written, it will in fact be a full ban on AI development. Actually proving an AI system to be safe is not something we can currently do.
Many possible implementations of it would not in fact ban AI development, but it's not clear that what they would do would actually relate to safety. For instance, I can easily imagine outcomes like:
- AI developers are required to submit a six-thousand-page 'proof' of safety to the satisfaction of some government bureau. This would work out to something along the lines of 'only large companies with compliance departments can develop AI', which might be beneficial under some sets of assumptions that I do not particularly share?
- AI developers are required to prove some narrow thing about their AI (e.g. that their AI will never output a racial slur under any circumstances whatsoever). While again this might be beneficial under some sets of assumptions, it's not clear that it would in fact have much relationship to AI safety.

^{^}

This post focuses on evals of existing models. It seems likely to me that a comprehensive FDA-like regulatory regime would also require evals of training runs before training begins, but I’ll leave that outside the scope of this post.

^{^}

A few groups are currently performing research designed to answer the question “what kind of evidence would allow us to confidently claim, beyond a reasonable doubt, that a model is safe?” Right now, I don’t think we have concrete answers, and I’m excited to see this research progress. One example of a criterion might be something like “we have sufficiently strong interpretability: we can fully or nearly-fully understand the decision-making processes of models, we have clear and human-understandable explanations to describe their cognition, and we have a strong understanding of why certain outputs are produced in response to certain inputs. Unsurprisingly, I think the burden of proof should be on companies to develop tests that can prove that their models are safe. Until they can, we should err on the side of caution.

LESSWRONG
LW

LESSWRONG
LW

27

Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous)

27

27