Epistemic status: Thinking out loud.
TL;DR: If alignment is just really difficult (or impossible for humanity), we might end up with an unaligned superintelligence which itself solves the alignment problem, gaining exponentially more power. If it is literally impossible, the superintelligence might see its capabilities capped in some regards.
In many discussions about misalignment, the examples of what would constitute dangerously powerful capabilities for an agent to have involve fine-grained and thorough understanding of its physical context. For instance, in the ELK report the following deception technique is considered: deploying undetected nanobots that infiltrate humans' brains and have their neurons fire at will (I will refer to this example throughout, but it's interchangeable with many others of similar spirit). Of course, very detailed knowledge about each particular brains' physical state must be known for this, which implies huge amounts of data and computations. This dynamic knowledge has to be either:
- All contained in (or enacted directly by) the agent: This seems implausible for this kind of overly detailed specifications. Granted the agent can have a very good probabilistic model of human psychology which it exploits (just as it can model other parts of the physical world). But brainhacking more than a few people probably requires an amount of data (and system sensors near the scene, and so on) too big even for this kind of systems (accounting for the placement of almost every neuron in the present and future, etc.). This is inspired by Information Theoretic intuitions that, even with near-future hardware, any simulation of reality with that much detail will be too costly (information cannot be compressed much further, the most efficient way to simulate reality is by far reality itself, etc.).
- Somehow spread over systems complementary to the agent: This would very probably involve creating systems to which to delegate computations and decisions about specific parts of the physical world. These systems with local knowledge will have to maximize a certain state of the world that the main agent wants to attain. If they deal with tasks as complex as manipulating humans, they can be expected to require independence and agency themselves, and so the main agent will have to solve the alignment problem for them. Failing to do so would probably bar the main agent from performing this kind of fine tampering with physical reality, and thus greatly limit its capabilities.
The core logical argument here might be nothing but a truism: conditional on humans not solving alignment, we want alignment to be impossible (or at least impossible for the superintelligences under consideration), since otherwise any (almost certainly unaligned) superintelligence will be even more powerful and transformative.
But furthermore I've tried to make the case for why this might be of special importance, by intuitively motivating why an agent might need to solve alignment to undertake many of the most useful tasks (and so solving alignment is not just an unremarkable capability, but one very important capability to have). That is, I'm arguing to update for the red quantity in the picture to be bigger than we might at first consider.
In fact, since solving alignment allows for the proliferation and iterative replication of agents obeying the main agent's goals, it's to be expected that its capabilities will be exponentially greater in a world in which it solves alignment (although of course an exponential increase in capabilities won't imply an exponential increase in existential risks, since a less capable unaligned superintelligence is already bad enough).
An agent can still be very dangerous by performing way less complex tasks, but being able to perform these tasks will likely increase danger. It is even possible that agents with simpler tasks are way easier to contain if we drastically limit the agents' possible actions over the world (by for instance only allowing them to output text data, etc.).
Disclaimer: I'm no expert in Information Theory nor hardware trends. I'm just hand-waving to the fact that the amount of computation needed would probably be unattainable.
These might or might not be at the same time the mobile sensors and actors themselves (the nanobots).