I will point out that this post is not the explicit argument or discussion of the paper - I'd be happy to discuss the technical oversight issues as well, but here I wanted to make the broader point.

In the paper, we do make these points, but situate the argument in terms of the difference between oversight and control, which is important for building the legal and standards arguments for oversight. (The hope I have is that putting better standards and rules in place will reduce the ability of AI developers and deployers to unknowingly and/or irresponsibly claim oversight when it's not occurring, or may not actually be possible.)

[-]Charlie Steiner5mo10

In a new paper with Aidan Homewood, "Limits of Safe AI Deployment: Differentiating Oversight and Control,"

Link should go to arxiv.

Still reading the paper, but it seems like your main point is that if oversight is "meaningful," then it should be able to stop bad behavior before it actually gets executed (it might fail, but failures should be somewhat rare). And that we don't have "meaningful oversight" of high-profile models in this sense (and especially not of the systems built on top of these models, considered as a whole) because they don't catch bad behavior before it happens.

Instead we have some weaker category of thing that lets the bad stuff happen, waits for the public to bring it to the attention of the AI company, and then tries to stop it.

Is this about right?

[-]Davidmanheim5mo40

Thanks - link fixed.

On your summary, that is not quite the main point. There are a few different points in the paper, and they aren't limited to frontier models. Overall, based on what we define, basically no one is doing oversight in a way that the paper would call sufficient, for almost any application of AI - if it's being done, it's not made clear how, what failure modes are being addressed, and what is done to mitigate the issues with different methods. As I said at the end of the post, if they are doing oversight, they should be able to explain how.

For frontier models, we can't even clearly list the failure modes we should be protecting against in a way that would let a human trying to watch the behavior be sure if something qualifies or not. And that's not even getting into the fact that there is no attempt to use human oversight - at best they are doing automated oversight of the vague set of things they want to model to refuse. But yes, as you pointed out, even their post-hoc reviews as oversight are nontransparent, if they occur at all, and the remediation when they are shown egregious failures by the public, like sycophancy or deciding to be MechaHitler, is largely doing further ad-hoc adjustments.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

48

No, We're Not Getting Meaningful Oversight of AI

48

48