x

Mark Nelson

Subscribe

Message

3

1y

Mark Nelson

Subscribe

Message

3

1y

Interpretability Will Not Reliably Find Deceptive AI

Mark Nelson1y10

Thank you Neel! I really appreciate the encouraging reply. Your point about needing a larger toolbox resonated particularly strongly for me. Mine is limited. I don't have access to Anthropic's circuit tracing (I desperately wish I did!) and I am not ready yet to try sampling logits or attention weights myself. Thus, I've been trying to understand how far I can reasonably go with blackbox testing using repetition across models, temperatures, and prompts. For now I'm focusing solely on analysis of the response, despite the limitation. I really do need your portfolio of imperfect defenses (though in my mind this toolbox is more than just defenses!). If you built it, I would use it.

Reply

Interpretability Will Not Reliably Find Deceptive AI

Mark Nelson1y40

@Neel Nanda: Hi, first-time commentor. I'm curious about what role you see for black-box testing methods in your "portfolio of imperfect defenses." Specifically, do you think there might be value in testing models at the edge of their reasoning capabilities and looking for signs of stress or consistent behavioral patterns under logical strain?

Reply

Open Thread Spring 2025

Mark Nelson1y10

Hi Carl,

I know your post is quite old. I am curious if you have a public github repository showcasing any of your work? I am especially interested in your tests regarding recursive logic traps as this is something I have also been studying in detail. Have you tried having the different models you've tested reason collaboratively?

Reply