LESSWRONG
LW

Mark Nelson
3030
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Interpretability Will Not Reliably Find Deceptive AI
Mark Nelson4mo10

Thank you Neel! I really appreciate the encouraging reply.  Your point about needing a larger toolbox resonated particularly strongly for me.  Mine is limited. I don't have access to Anthropic's circuit tracing (I desperately wish I did!) and I am not ready yet to try sampling logits or attention weights myself. Thus, I've been trying to understand how far I can reasonably go with blackbox testing using repetition across models, temperatures, and prompts.  For now I'm focusing solely on analysis of the response, despite the limitation.  I really do need your portfolio of imperfect defenses (though in my mind this toolbox is more than just defenses!).  If you built it, I would use it.

Reply
Interpretability Will Not Reliably Find Deceptive AI
Mark Nelson4mo40

@Neel Nanda: Hi, first-time commentor.  I'm curious about what role you see for black-box testing methods in your "portfolio of imperfect defenses." Specifically, do you think there might be value in testing models at the edge of their reasoning capabilities and looking for signs of stress or consistent behavioral patterns under logical strain?

Reply
Open Thread Spring 2025
Mark Nelson5mo10

Hi Carl,

I know your post is quite old.  I am curious if you have a public github repository showcasing any of your work?  I am especially interested in your tests regarding recursive logic traps as this is something I have also been studying in detail.  Have you tried having the different models you've tested reason collaboratively?

Reply
No posts to display.