These are my personal views. Thank you to Ryan Greenblatt, Holden Karnofsky, and Peter Wildeford for useful discussions. The bad takes are my own.
When deciding how much to spend on mitigating a vulnerability that a competent scheming AI might exploit, you might be tempted to use E[damages | AI decides to take advantage of the vulnerability] to decide how important mitigating that vulnerability is.
But this misses how a strategic scheming AI might decide to not take advantage of some vulnerabilities, because being caught could reduce how likely it would be to take advantage of better future opportunities. So instead you might want to focus on making net-sabotage-value negative:
Focus on mitigating vulnerabilities that
... (read 2965 more words →)
Does the same argument apply to social media, where memes are propagated by the platform, and thus don't need the people who believe in the memes to be sane and powerful?
I think that one of the following has to be true:
- Like social media, the consequences
... (read more)