Shivam — LessWrong

PhD in Geometric Group Theory >> Postdoc in Machine Learning >> AI safety and AI alignment Research.
Mech Interp at LASR labs

Quite insightful. I am a bit confused by this:

"Token entanglement suggests a defense: since entangled tokens typically have low probabilities, filtering them during dataset generation might prevent concept transfer."

Can you explain that more?

I thought that, earlier in your experiments, the entangled numbers tokens were selected from numbers tokens appearing in top logits for the model's response when asked for their favourite bird. Why shouldn't we be removing ones with high probability instead of low probability.

I completely agree and this is what I was thinking here in this short form, that we should have an equivalence of capability and risks benchmarks. As all AI companies try to beat the capability benchmark and fail the safety/risks benchmarks due to obvious bias.
The basis premise of this equivalency being that if the model is smart enough to beat x benchmark then it is good enough to construct or help with CBR attacks, especially given that they can not be guaranteed to be immune to jailbreaks.

An important work in AI safety should be to prove equivalency of various Capability benchmarks to Risk benchmarks. So that, when AI labs show their model is crossing a capability benchmark, they are automatically crossing a AI safety level.
"So we don't have two separate reports from them; one saying that the model is a PhD level Scientist, and the other saying that studies shows that the CBRN risk with model is not more than internet search."

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments