LESSWRONG
LW

Arne Huang
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Interpretability Will Not Reliably Find Deceptive AI
Arne Huang4mo10

Hmm, I thought a portfolio approach to safety is what Anthropic has been saying is the best approach for years e.g. from 2023: https://www.anthropic.com/news/core-views-on-ai-safety. Has that shifted recently toward a more "all eggs in interpretability" basket to trigger this post? 

 

I would also love to understand, maybe as a followup post, what the pragmatic portfolio looks like - and what's the front runner in this portfolio. Since I sort of think it's Interpretability, so even though we can't go all in on it, it's still maybe our best shot and thus deserves appropriate resource allocation?

Reply
No posts to display.