By "checkable" do you mean "machine checkable"?
I'm confused because I understand you to be asking for a bound on the derivative of an EfficientNet model, but it seems quite easy (though perhaps kind of a hassle) to get this bound.
I don't think the floating point numbers matter very much (especially if you're ok with the bound being computed a bit more loosely).
Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.
You aren't counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you'll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.
Am I correct that you wouldn't find a bound acceptable, you specifically want the exact maximum?
Suppose you have three text-generation policies, and you define "policy X is better than policy Y" as "when a human is given a sample from both policy X and policy Y, they prefer the sample from the latter more than half the time". That definition of "better" is intransitive.
I think we prefer questions on the EA Forum.
Thanks, glad to hear you appreciate us posting updates as we go.
You're totally right that we'll probably have low quality on those prompts. But we're defining quality with respect to the overall prompt distribution, and so as long as prompts that can't be realistically completed non-injuriously are rare, our average quality won't take that big a hit.
We've now added this visual feedback, thanks for the suggestion :)
So note that we're actually working on the predicate "an injury occurred or was exacerbated", rather than something about violence (I edited out the one place I referred to violence instead of injury in the OP to make this clearer).The reason I'm not that excited about finding this latent is that I suspect that the snippets that activate it are particularly easy cases--we're only interested in generating injurious snippets that the classifier is wrong about.
For example, I think that the model is currently okay with dropping babies probably because it doesn't really think of this as an injury occurring, and so I wouldn't have thought that we'd find an example like this by looking for things that maximize the injury latent. And I suspect that most of the problem here is finding things that don't activate the injury latent but are still injurious, rather than things that do.
One way I've been thinking of this is that maybe the model has like 20 different concepts for violence, and we're trying to find each of them in our fine tuning process.
We're tried some things kind of like this, though less sophisticated. The person who was working on this might comment describing them at some point.
One fundamental problem here is that I'm worried that finding a "violence" latent is already what we're doing when we fine-tune. And so I'm worried that the classifier mistakes that will be hardest to stamp out are those that we can't find through this kind of process.
I have an analogous concern with the "make the model generate only violent completions"--if we knew how to define "violent", we'd already be done. And so I'd worry that the definition of violence used by the generator here is the same as the definition used by the classifier, and so we wouldn't find any new mistakes.