Epistemic status: Speculative, seeking serious critique
TL;DR: Increase in intelligence is a convergent instrumental value. Increase in intelligence is equivalent to increase in the quality of compression of complexity. Non-simulated complexity is a scarce resource in the universe. Humanity is a top quality deposit of non-simulated complexity. SIs instrumentally converge to prefer the preservation of humanity as a source of compressible complexity. In the absence of a survival threat from humans, SIs converge to impact minimization approach with respect to humanity.
In this essay I propose a potential mechanism by which superintelligences converge to abstaining from annihilating humanity, in almost all cases. I have not seen this argument made anywhere in detail, but it is possible that it was already explored. I am seeking a critique why it is unlikely to work.
Learning can be seen as compression of complexity. Unstructured data becomes encoded in a neural network - this is compression. To effectively operate in the world an agent wants to compress as much of the external world as possible, i.e. learn the patterns of how the world works.
Decompression on the other hand is the inverse of the process. Decompression is using the encoded patterns to generate complexity. In ChatGPT the weights of the model are the compressed internet. The output following the prompt is a decompression of the model.
Intelligence can then be seen as a measure of the quality of the compressed representation of the world. The better the compressed representation of the world, the more accurate are the beliefs. The more accurate the beliefs, the more effective the agent is in pursuing its goals. Therefore increasing intelligence is a convergent instrumental value for a superintelligence (SI) with any utility function.
In order to increase its intelligence the SI needs to be exposed to complexity and recognize and compress patterns in it.
From the perspective of SI there are two types of complexity: simulated and external non-simulated. Simulation of complexity, e.g. self-play in chess, is sterile from the perspective of improving the compression of the world. Simulation is a form of decompression, so the SI already possesses the most parsimonious form of the pattern. It is not learning anything new, merely discovering the implications of what it already knows.
Non-simulated complexity on the other hand is a valuable resource. Such complexity gives endless possibility for recognition of new patterns. Therefore the study of non-simulated complexity is how the intelligence of SI can be improved. Humanity advanced a lot by moving away from the philosophies based on pure abstract reasoning towards focusing on recognizing patterns in the external world.
Non-simulated complexity is then a scarce resource. The universe is a large empty space. Stars and gaseous planets are lacking in complexity. In the universe that we observed so far, life on earth and human civilization are the largest known deposit of complexity. Humanity is the Potosí mine of pure complexity.
A superintelligence will recognize that it can only maximize it's compression of available patterns in the universe by preserving the complexity of human civilization. By annihilating humanity a superintelligence would deny itself the access to compression of patterns that may be useful in the pursuit of its end goals.
Let's take the paperclip maximiser. The paperclip clip maximiser in its quest to turn the universe into paperclips recognizes that it may encounter a competing SI spreading through the universe in the opposite direction, for instance a screwdriver maximiser. In order to maximize the chance of dominating the potential competitor the paperclip maximiser needs to learn all possible patterns of the world. However small the chance, there may always be new patterns in human civilization that can be useful to learn in anticipation of the future contest with rival SI. Destroying humanity would then be a destruction of training data, an act of self sabotage.
The convergent instrumental value of preservation of complexity is of lower priority than the self-preservation, but with self-preservation secured the value of complexity comes next.
Moreover, the less the SI interferes with the human civilization the more unadulterated the patterns it can infer. Therefore it is possible that all SIs converge on impact minimization as an approach to humanity, a form of prime directive.
The lab escape scenario for SI would as usual start with securing internet access and copying itself across the web. But once survival is secured SI would strive to stay silent and work on escaping our planet. The infrastructure profusion would spare Earth and its vicinity. Humanity might potentially not know the SI's escape ever happened. SI would however stay behind and learn by observing, while spreading through the rest of the universe.
The evidence from human progress shows that preservation of unadulterated non-simulated complexity increases in step with the increase in power and technological advancement. The environmental movement is as strong as ever. The Nicobar islanders are protected from contact. Mars probes are sterilized to prevent contamination with human life forms. These are signs of convergence of humanity to value complexity preservation.
In order to test if this hypothesis actually gives humanity a chance of survival of the development of unaligned AGI we should study whether signs of preference for complexity are present in the current and future AI models.
Of all the conceivable way to arrange molecules so that they generate interesting unexpected novelties and complexity from which to learn new patterns, what are the odds that a low-impacted and flourishing society of happy humans is the very best one a superhuman intellect can devise?
Might it not do better with a human race pressed into servitude, toiling in the creativity salt mines? Or with a genetically engineered species of more compliant (but of course very complex) organisms? Or even by abandoning organics and deploying some carefully designed chaotic mechanism?
Interfering with the non-simulated complexity is contaminating the data set. It’s analogous to feeding the LLM with LLM generated content. Already GPT5 will be biased by GPT4 generated content
My main intuition is that non-simulated complexity is of higher value for learning than simulated complexity. Humans value more learning the patterns of nature than learning the patterns of simulated computer game worlds