I think estimating the probability/plausibility of real-world inner alignment problems is a neglected issue.
However, I don't find your analysis very compelling.
Number 1 seems to me to approach this from the wrong angle. This is a technical problem, not a social problem. The social version of the problem seems to share very little in common with the technical version.
Number 2 assumes the AGI is aligned. But inner alignment is a barrier to that. You cannot work from the assumption that we have a powerful AGI on our side when solving alignment problems, unless you've somehow separately ensured that that will be the case.
Number 3 isn't true in current deep learning architectures, EG, GPT; it seems you'd have to design a system where that's the case. And it's not yet obvious whether that's a promising route.
Number 4: an AGI could make it to SI by inner alignment failure. The inner optimizer could solve the inner alignment problem for itself, while refusing to be outer-aligned.
The AGI will be much more powerful than the agents.
It's a security issue, in general a mesa-optimizer is not intended or expected to be there-in-particular at all (apart from narrow definitions of inner alignment describing more tractable setups, which won't capture the actually worrying mesa-optimization that happens inside architectural black boxes, see point 17). Given enough leeway, it might get away with setting up steganographic cognition and persist in influencing its environment, even in the more dignified case when there actually is any monitoring for mesa-optimizers inside black boxes.
It is true that an AGI may want to create other AGIs and therefore may have to deal with both outer and inner alignment problems. Even if it just creates copies of itself initially, the copies may develop somewhat independently and become misaligned. They may even aggregate into organizations that have their own incentives and implied goals separate from any of its components. If it intends to create a more powerful successor or self-modify, the challenges it faces will be many of the same that we face in creating AGI at all.
This isn't a cause for optimism.
That just makes the problem worse for us. A weakly superhuman AGI that doesn't itself solve all alignment problems may create or become strongly superintelligent successors that don't share any reasonable extrapolation of its previous goals. Thus they will be even more likely to be divergent from anything compatible with human flourishing than if it had solved alignment.
I think it's right. Inner alignment is getting the mesa-optimizers (agents) aligned with the overall objective. Outer alignment ensures the AI understands an overall objective that humans want.
Not quite. Inner alignment, as originally conceived, is about the degree to which the trained model is optimizing for accomplishing the outer objective. Theoretically, you can have an inner-misaligned model that doesn't have any subagents (though I don't think this is how realistic AGI will work).
E.g., I weakly suspect that a reason deep learning models are so overconfident is actually due to an inner-misalignment between the predictive patterns SGD instills and the outer optimization criterion, where SGD is systematically under-penalizing the model's predictive patterns for over-confident mispredictions. If true, that would represent an inner misalignment without there being any sort of deception or agentic optimization from the model's predictive patterns, just an imperfection in the learning process.
More broadly, I don't think we actually want truly "inner-aligned" AIs. I think that humans, and RL systems more broadly, are inner-*misaligned* by default, and that this fact is deeply tied in with how our values actually work. I think that, if you had a truly inner-aligned agent acting freely in the real world, that agent would wirehead itself as soon as possible (which is the action that generates maximum reward for a physically embedded agent). E.g., humans being inner-misaligned is why people who learn that wireheading is possible for humans don't immediately drop everything in order to wirehead.
I see. So the agent issue I address above is a sub-issue of overall inner alignment.
In particular, I was the addressing deceptively aligned mesa-optimizers, as discussed here: https://astralcodexten.substack.com/p/deceptively-aligned-mesa-optimizers
Thanks!
After reading Eliezer's list of lethalities, I have doubts (hopes?) that some of the challenges he mentions will occur.
Let's start with inner alignment. Let's think step by step. 😁
Does that make sense? Agree? Disagree?