Having human values is insufficient for alignment
Suppose there's a button where if you push it and name a human, that human becomes 1,000,000,000,000,000x more powerful. (What I mean by that isn't precisely specified—imagine some combination of being able to think much faster, becoming more intelligent, and having far more resources, to the point that they could easily overpower the rest of the world.)
Try running some thought experiments where you push the button to amplify:
- Adolf Hitler
- Donald Trump
- Kim Jong-un
- King Salman of Saudi Arabia
- Ayn Rand
- Elon Musk
- Ray Kurzweil
- Eliezer Yudkowsky
My intuition is that some of these people are catastrophic to amplify, and some might be OK to amplify. It's interesting to me that amplifying some of these people might be catastrophic, given that they're fellow human beings, raised in human societies, born with human genomes, who almost certainly care about the future well-being of humanity.
One reason I’d feel queasy amplifying anyone is that they might fall into an epistemic pit, where they arrive at some critically wrong conclusion and take either huge or infinite amounts of time to update away from it. If someone’s reasoning process gets amplified, I wouldn’t generally trust them to be good at arriving at true beliefs—intelligence needn’t go hand-in-hand with rationality or philosophical competence.
In particular, it’s very unclear to me whether people would quickly update away from ideologies. In practice, humanity as a whole has not obviously fallen into any permanent epistemic pits, but I think this is because no single ideology has clearly dominated the world. If you have indefinite decisive power over the world, you have far less incentive to consider perspectives very different from your own, and unless you both care about and are good at seeking true beliefs, you wouldn’t do a good job learning from the people around you.
Another reason I’d feel queasy amplifying anyone is that they might take irreversible catastrophic actions (perhaps unknowingly). Genocides would be one example. Restructuring society such that it gets forever stuck in an epistemic pit would be another. Building a superintelligence without appreciating the risks is yet another (and clearly the most disastrous, and also the least obviously disastrous).
I consider these all failures in something I’ll term metaphilosophical competence. (Please excuse the unwieldy name; I hope to find a better descriptor at some point.) If someone were sufficiently metaphilosophically competent, they should figure out how to arrive at true beliefs relatively quickly and prioritize doing so. They should gain an appreciation of the importance and difficulty of avoiding catastrophic consequences in a world with so much uncertainty, and prioritize figuring out how to do good in a way that sets them apart from everyone who self-deludes into thinking they do good. They should be able to do this all correctly and expeditiously.
I interpet the goal of MIRI's agent foundations research agenda as providing a formal specification of metaphilosophical competence. For example, I interpret the logical induction criterion as part of a formal specification of what it means to have idealized reasoning in the limit. I intend to write more about this relationship at a future point.
All potential self-amplifiers should want to (and may not) be sufficiently metaphilosophically competent before self-amplifying
It's not just humans that should care about metaphilosophical competence. If Clippy (our favorite paperclip-maximizing superintelligence) wanted to build a successor agent far more powerful than itself, it would also want its successor to not take catastrophic irreversible actions or fall into epistemic pits.
Just because Clippy is superintelligent doesn't mean Clippy will necessarily realize the importance of metaphilosophy before building a successor agent. Clippy will probably eventually care about metaphilosophical competence, but it’s possible it would come to care only after causing irreversible damage in the interim (for example it might have built a catastrophically misaligned subagent, a.k.a. a daemon). It's also conceivable it falls into an epistemic pit in which it never comes to care about metaphilosophy.
Acknowledging metaphilosophical competence may be insufficient for safe self-amplification
It might be sufficient for an agent that isn't yet completely metaphilosophically competent, but sufficiently “proto-metaphilosophically competent” to self-amplify. For example, the first thing it might do upon self-amplification is do nothing except determine a formal specification of metaphilosophical competence, then create a successor agent that’s formally guaranteed to be metaphilosophically competent.
I'd feel good if I could be confident that would happen, but I'm not sure "do nothing but become more metaphilosophically competent" actually makes sense. Maybe it would make sense if you're smart enough that you could work through the aforementioned process in just a few seconds, but if for example the process takes much longer and you're in an unsafe or unstable environment, you'd have to trade off figuring out metaphilosophy with fending off imminent threats, which may involve taking irreversible catastrophic actions before you've actually figured out metaphilosophy.
(OK, metaphilosophy seems important to figure out. Wait, we might get nuked. Wait, synthetic viruses are spreading. Ahhhhh! Powerful AI's seem like the only way out of this mess. Ack, my AI isn't powerful enough, I should make it stronger. Okay, now it's... wait... oops...)
AI safety crux: Which humans are metaphilosophically competent enough to safely amplify?
Obviously some humans have not crossed the bar for metaphilosophical competence—if a naive negative utilitarian or angsty teenager gets 1,000,000,000,000,000x'd, they might literally just kill everyone. This invites the question of which people have crossed the metaphilosophical bar for safe 1,000,000,000,000,000x’ing.
I think this is an open question, and I suspect this is a major crux people have about the necessity or usefulness of agent foundations, as well as optimism about how AGI will play out. My guess is that if someone thinks tons of people have passed this bar, they’d think ML-based approaches to safety can lead us to a safe AGI, and are generally more optimistic about the world getting AI safety right. On the flip side, if they think practically nobody is sufficiently metaphilosophically competent to safely amplify, they’d highly prioritize metaphilosophical work (e.g. things in the direction of agent foundations), and feel generally pessimistic about the world getting AI safety right.