LESSWRONG
LW

1421
Shivam
2330
Message
Dialogue
Subscribe
  • PhD in Geometric Group Theory >> Postdoc in Machine Learning >> AI safety and AI alignment Research.
  • Mech Interp at LASR labs

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
1Shivam's Shortform
8mo
1
It's Owl in the Numbers: Token Entanglement in Subliminal Learning
Shivam1mo10

Quite insightful. I am a bit confused by this:

"Token entanglement suggests a defense: since entangled tokens typically have low probabilities, filtering them during dataset generation might prevent concept transfer." 

Can you explain that more?

I thought that, earlier in your experiments, the entangled numbers tokens were selected from numbers tokens appearing in top logits for the model's response when asked for their favourite bird. Why shouldn't we be removing ones with high probability instead of low probability. 

Reply
AI companies' eval reports mostly don't support their claims
Shivam5mo10

I completely agree and this is what I was thinking here in this short form, that we should have an equivalence of capability and risks benchmarks. As all AI companies try to beat the capability benchmark and fail the safety/risks benchmarks due to obvious bias. 
The basis premise of this equivalency being that if the model is smart enough to beat x benchmark then it is good enough to construct or help with CBR attacks, especially given that they can not be guaranteed to be immune to jailbreaks. 

Reply
Shivam's Shortform
Shivam8mo10

An important work in AI safety should be to prove equivalency of various Capability benchmarks to Risk benchmarks. So that, when AI labs show their model is crossing a capability benchmark, they are automatically crossing a AI safety level. 
"So we don't have two separate reports from them; one saying that the model is a PhD level Scientist, and the other saying that studies shows that the CBRN risk with model is not more than internet search." 

Reply
1Shivam's Shortform
8mo
1
1The Road to Evil Is Paved with Good Objectives: Framework to Classify and Fix Misalignments.
9mo
0
2Limits of safe and aligned AI
1y
0