Here's my attempted phrasing, which I think avoids some of the common confusions:
Suppose we have a model with utility function , where is not capable of taking over the world. Assume that thanks to a bunch of alignment work, is within (by some metric) of humanity's collective utility function. Then in the process of maximizing , ends up doing a bunch of vaguely helpful stuff.
Then someone releases model with utility function , where is capable of taking over the world. Suppose that our alignment techniques generalize perfectly. That is, is also within of humanity's collective utility function, where Then in the process of maximizing , gets rid of humans and rearranges their molecules to satisfy better.
Does this phrasing seem accurate and helpful?
Upon reflection, I agree that my previous comment describes fragility of value.
My mental model is that the standard MIRI position[1] claims the following [2]:
1. Because of the way AI systems are trained, δ,δ′ will be large even if we knew humanity's collective utility function and could target that (this is inner misalignment)
2. Even if δ′ were fairly small, this would still result in catastrophic outcomes if M′ is an extremely powerful optimizer (this is fragility of value)
A few questions:
3. Are the claims (1) and (2) accurate representations of inner misalignment and fragility of value?
4. Is the "misgeneralization" claim just "δ′ will be much larger than δ"?
If the answer to (4) is yes, I am confused as to why the misgeneralization claim is brought up. It seems that (1) and (2) are sufficient to argue for AI risk.. By contrast, it seems that the misgeneralization claim is neither sufficient nor necessary to make a case for AI risk. Furthermore, the misgeneralization claim seems less likely to be true than (1) and (2).
Also let me know if I am thinking about things in a completely wrong framework and should scrap my made up notation.
There's probably a better name for this. Please suggest one!
Non-exhaustive list.