Arne Huang — LessWrong

Hmm, I thought a portfolio approach to safety is what Anthropic has been saying is the best approach for years e.g. from 2023: https://www.anthropic.com/news/core-views-on-ai-safety. Has that shifted recently toward a more "all eggs in interpretability" basket to trigger this post?

I would also love to understand, maybe as a followup post, what the pragmatic portfolio looks like - and what's the front runner in this portfolio. Since I sort of think it's Interpretability, so even though we can't go all in on it, it's still maybe our best shot and thus deserves appropriate resource allocation?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments