LESSWRONG
LW

687
Mark Nelson
3030
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
Interpretability Will Not Reliably Find Deceptive AI
Mark Nelson6mo10

Thank you Neel! I really appreciate the encouraging reply.  Your point about needing a larger toolbox resonated particularly strongly for me.  Mine is limited. I don't have access to Anthropic's circuit tracing (I desperately wish I did!) and I am not ready yet to try sampling logits or attention weights myself. Thus, I've been trying to understand how far I can reasonably go with blackbox testing using repetition across models, temperatures, and prompts.  For now I'm focusing solely on analysis of the response, despite the limitation.  I really do need your portfolio of imperfect defenses (though in my mind this toolbox is more than just defenses!).  If you built it, I would use it.

Reply
Interpretability Will Not Reliably Find Deceptive AI
Mark Nelson6mo40

@Neel Nanda: Hi, first-time commentor.  I'm curious about what role you see for black-box testing methods in your "portfolio of imperfect defenses." Specifically, do you think there might be value in testing models at the edge of their reasoning capabilities and looking for signs of stress or consistent behavioral patterns under logical strain?

Reply
Open Thread Spring 2025
Mark Nelson7mo10

Hi Carl,

I know your post is quite old.  I am curious if you have a public github repository showcasing any of your work?  I am especially interested in your tests regarding recursive logic traps as this is something I have also been studying in detail.  Have you tried having the different models you've tested reason collaboratively?

Reply
No wikitag contributions to display.