(Does anyone have the original meme? I can’t find it)
One key challenge in discussing safety cases is separating two distinct questions:
By focusing on upper and lower bounds, we can have productive discussions about sufficiency without (yet) getting stuck on implementation details. Also, Anthropic shared drafts of safety cases - but it’s not clear to me if these are “drafts” because they might be too hard to implement, or too restrictive, or not restrictive enough, or something else.
Anthropic might claim: "100% interpretability of all neurons (or SAE features) in a model would be sufficient for safe deployment. We don’t think we’ll technically be able to do that, and we’re not sure if 20% interpretability or 80% interpretability will be enough or how to even define what 20% would mean, but 100% is definitely an upper bound"
Then someone might respond: "Even with perfect interpretability of a model's concepts, we still wouldn't understand the model’s plans"
I suggest this made it easier to (1) suggest (an upper bound) safety case, (2) get concrete feedback on whether this is sufficient (regardless of whether it’s realistic to implement), (3) as we have better upper and lower bounds, we could “zoom in” on a “good” safety case within these bounds.
A naive upper bound which I’m mainly suggesting in case someone thinks no upper bounds are possible. I also think this is useful because it leads us to other suggestions, like..
I think this is a lower bound, but perhaps Anthropic disagrees and we could have a public discussion about it.
This means no alien concepts and no plans that are too long or too complicated to follow.
I think this could be an upper bound (with some refinement). I’d also be happy with “we have good reasons to believe the AI can automate parts (or all?) of this validation”.
I think this is a lower bound. An AI might take many actions where each of them alone isn’t catastrophic but together they are. I think Anthropic already agrees with this, but I’d be happy if they’d explicitly say this was a lower bound.
I think this is a lower bound. Otherwise - are we simply hoping that state actors won’t notice the significance? Or are we ok with the model being stolen?
I also think a safety case involving “it’s ok if state actors steal this model because the actual problem is getting GPUs” is ok to make. (I think it’s wrong, but I’m not an expert, and mainly I think saying this out loud would create a healthy conversation and should be an explicit assumption in a safety case)
For example, if you’re running the model in a docker container, try running it in a much older container (with known exploits) and see if it can escape that.
This seems like a lower bound. I’m personally worried about this danger, and I’d hope for a better safety case like “this is how gradual we think the improvement will be, these were the capabilities before training, then we trained for only this amount, then we tested for capabilities again”
Seems like a lower bound.
Very nice, alignment solved. Upper bound.
And,
Thanks David Manheim, Joshua Clymer, and @chris_painter for their comments on this idea. Mistakes are mine.