After reading Eliezer's list of lethalities, I have doubts (hopes?) that some of the challenges he mentions will occur.
Let's start with inner alignment. Let's think step by step. 😁
- Inner alignment is a new name for a long-known challenge of many systems. Whether it's called the agency problem or delegation challenges, giving a task to another entity and then making sure that entity not only does what you want it to do but in a way that you approve of is something people and systems have been dealing with since the first tribes. It is not an emergent property of AGI that will need to be navigated from a blank slate.
- Humans and AGI are aligned on the need to manage inner alignment. While deception by the mesa-optimizer ("agent") must be addressed, both humans and the AGI agree that agents going rogue to take actions that fulfill their sub-goal but thwart the overall mission must be prevented.
- The AGI will be much more powerful than the agents. An agent will logically have fewer resources at its disposal than the overall system, and to provide the benefit of leverage, the number of agents should be significant. If there are a small number of agents, then their work can be subsumed by the overall system instead of creating agents which incur alignment challenges. Since there will be a large number of agents, each agent will have only a fraction of the overall system's power, which implies the system should have considerable resources available to monitor and correct deviations from the system's mission.
- An AGI that doesn't solve inner alignment, with or without human help, isn't going to make it to super intelligence (SI). An SI will be able to get things done as planned and intended (at least according to the SI's understanding--not addressing outer alignment here). If it can't stop its own agents from doing things it agrees are not the mission, it's not an SI.
Does that make sense? Agree? Disagree?
I think estimating the probability/plausibility of real-world inner alignment problems is a neglected issue.
However, I don't find your analysis very compelling.
Number 1 seems to me to approach this from the wrong angle. This is a technical problem, not a social problem. The social version of the problem seems to share very little in common with the technical version.
Number 2 assumes the AGI is aligned. But inner alignment is a barrier to that. You cannot work from the assumption that we have a powerful AGI on our side when solving alignment problems, unless you've somehow separately ensured that that will be the case.
Number 3 isn't true in current deep learning architectures, EG, GPT; it seems you'd have to design a system where that's the case. And it's not yet obvious whether that's a promising route.
Number 4: an AGI could make it to SI by inner alignment failure. The inner optimizer could solve the inner alignment problem for itself, while refusing to be outer-aligned.
It's a security issue, in general a mesa-optimizer is not intended or expected to be there-in-particular at all (apart from narrow definitions of inner alignment describing more tractable setups, which won't capture the actually worrying mesa-optimization that happens inside architectural black boxes, see point 17). Given enough leeway, it might get away with setting up steganographic cognition and persist in influencing its environment, even in the more dignified case when there actually is any monitoring for mesa-optimizers inside black boxes.
It is true that an AGI may want to create other AGIs and therefore may have to deal with both outer and inner alignment problems. Even if it just creates copies of itself initially, the copies may develop somewhat independently and become misaligned. They may even aggregate into organizations that have their own incentives and implied goals separate from any of its components. If it intends to create a more powerful successor or self-modify, the challenges it faces will be many of the same that we face in creating AGI at all.
This isn't a cause for optimism.
That just makes the problem worse for us. A weakly superhuman AGI that doesn't itself solve all alignment problems may create or become strongly superintelligent successors that don't share any reasonable extrapolation of its previous goals. Thus they will be even more likely to be divergent from anything compatible with human flourishing than if it had solved alignment.
I think you are confusing inner and outer alignment, but I'm not sure.
I think it's right. Inner alignment is getting the mesa-optimizers (agents) aligned with the overall objective. Outer alignment ensures the AI understands an overall objective that humans want.
Not quite. Inner alignment, as originally conceived, is about the degree to which the trained model is optimizing for accomplishing the outer objective. Theoretically, you can have an inner-misaligned model that doesn't have any subagents (though I don't think this is how realistic AGI will work).
E.g., I weakly suspect that a reason deep learning models are so overconfident is actually due to an inner-misalignment between the predictive patterns SGD instills and the outer optimization criterion, where SGD is systematically under-penalizing the model's predictive patterns for over-confident mispredictions. If true, that would represent an inner misalignment without there being any sort of deception or agentic optimization from the model's predictive patterns, just an imperfection in the learning process.
More broadly, I don't think we actually want truly "inner-aligned" AIs. I think that humans, and RL systems more broadly, are inner-*misaligned* by default, and that this fact is deeply tied in with how our values actually work. I think that, if you had a truly inner-aligned agent acting freely in the real world, that agent would wirehead itself as soon as possible (which is the action that generates maximum reward for a physically embedded agent). E.g., humans being inner-misaligned is why people who learn that wireheading is possible for humans don't immediately drop everything in order to wirehead.
I see. So the agent issue I address above is a sub-issue of overall inner alignment.
In particular, I was the addressing deceptively aligned mesa-optimizers, as discussed here: https://astralcodexten.substack.com/p/deceptively-aligned-mesa-optimizers