Back in mid-February, I posted "A research agenda for the final year", which poses a small set of basic questions. The idea is that if we can answer those questions correctly, then we might have a plan for the creation of a human-friendly superintelligent AI.
Now I want to sketch what an answer (and its implementation) could look like. There are no proofs of anything here, just several exploratory hypotheses. They are meant to provide a concrete image of what to aim for, and are subject to revision if they prove to be misguided.
The ontological hypotheses are panprotopsychism and interacting monads. Panpsychism is an ontology in which all the elementary things have minds. Panprotopsychism is an ontology in which all the elementary things lie on a continuum with "having a mind". As for interacting monads, @algekalipso made a post in January which illustrates the formal structure one might expect: a causal network that is dynamic like Wolfram's hypergraphs rather than fixed like Conway's Game of Life, and in which the monads have significant dynamic internal structure too. In terms of physical theory, these monads might be "blocks of entanglement" or "geometric atoms" or some other ultimate constituent. The postulate of panprotopsychism here means that awareness, and its more elaborate forms like consciousness or subjectivity, arises when these monads possess the appropriate internal structure.
One may ask why I am supposing this somewhat exotic theory of the conscious mind, in which a person is some kind of nonseparable quantum state of the brain, rather than a more conventional information-processing model, in which they are a particular virtual state machine existing more at the level of neurons than at the level of quantum physics. The reason is just that I consider the exotic option more likely. However, the reader may wish to substitute Markov blankets for monads if they prefer the conventional model.
So we have our ontology: the physical world is made of @algekalipso's process-topological monads, interacting according to some fundamental psychophysical law analogous to the rules that govern cellular automata, with consciousness occurring only in monads with the right intricate internal structure.
The ethical hypothesis is that the ultimate value system is governed by an appropriate aggregation of the valences and preferences of all the conscious monads. Valences here are pleasure, pain, and possibly other kinds of qualic intrinsic value. Preferences are included so that more abstract dispositions like judgments and decisions can be counted too. I won't concern myself with the details of the aggregation here, utilitarian theory offers many possibilities, I will merely suppose that it has been determined by some CEV-like process.
So we have a world model and a value system, a value system which is to govern the transhuman civilization of the future. How should we imagine this working? We can expect the future to be extremely complex and diverse by human standards, populated with entities at very different levels of intelligence... It may be abstract and uninspiring for some people, to think of it this way, but I can envision this ultimate value system having a status in the transhuman civilization, that an economic philosophy or ideology can have in human civilization. Call this the political hypothesis. Just as the members of human civilization vary greatly in their knowledge of economics, from ignorant to expert, but all the big organizations of human society have to pay it some heed, and just as human governments monitor economic data, develop economic policy, and implement it using institutions of law and power... so too, analogous arrangements may exist throughout transhuman civilization, with respect to its ultimate value system.
This "political" scenario is not a normative proposal. The only normative thing is the ultimate value system, and it should determine the political order of the transhuman civilization that it governs. I'm just giving the vaguest of speculative sketches as to what that world would look like, and how it would run.
But now we come to the real crux. Back in the present, we live in a world where a few giant companies are pushing the capabilities of AI ever further forward, using architectures that are basically augmented transformers. Suppose that one of these companies will imminently align one of the superintelligent augmented transformers, with the world model and value system mentioned above, and that this superaligned AI goes on to be the seed of transhuman civilization. How is the alignment done?
You might suppose that this is easier than the value learning which happens at present, because it involves a rigorously specified target. It's like getting the AI to learn the rules of a game (the world model) and the conditions of victory (the value system), just as AlphaGo and AlphaZero did. By comparison, today's value learning occurs implicitly, as part of the world modeling that a new AI does, on the basis of its training corpora.
There may be two things to accomplish here. One is to create a subsystem which becomes expert at the theory of the world model and the value system. The prototype here would be all the efforts being made to use AI in formalization, hypothesis generation, theorem proving, and so on. The other is to connect that theoretical knowledge to the general practical knowledge that gets extracted from the big training corpora, so that the AI can interpret the concrete world in terms of its ultimate world model. The mechanics of this will depend on the details of the AI training pipeline used by the company in question.
Another likely consideration is that you may not wish your super-aligned super-AI to just believe dogmatically in the ultimate world model and ultimate value system. It should have an autonomous epistemic capacity for critical thought, by means of which it can test their robustness. If the world model and value system were already obtained purely by human effort, they should already have been subjected to some skeptical testing. If AI already played a role in deriving them, then they have already been subjected to some machine epistemology... I need to think further about this aspect.
That concludes the sketch for now. I have skipped over all the genuinely technical issues of AI safety that come into play when you reach the level of detailed architectures, the technical issues whose unresolved nature makes people so alarmed about the creation of superintelligence at the current level of knowledge. What I wanted to do was to outline a rather classical scenario of alignment against an exact value system, in a way that dovetails with the world of 2026. It seems like an agenda that could already be pursued to the end - I don't see any fundamental barriers. It makes me wonder what similar (but far more detailed) position papers and contingency plans may already exist inside the frontier AI companies.
Back in mid-February, I posted "A research agenda for the final year", which poses a small set of basic questions. The idea is that if we can answer those questions correctly, then we might have a plan for the creation of a human-friendly superintelligent AI.
Now I want to sketch what an answer (and its implementation) could look like. There are no proofs of anything here, just several exploratory hypotheses. They are meant to provide a concrete image of what to aim for, and are subject to revision if they prove to be misguided.
The ontological hypotheses are panprotopsychism and interacting monads. Panpsychism is an ontology in which all the elementary things have minds. Panprotopsychism is an ontology in which all the elementary things lie on a continuum with "having a mind". As for interacting monads, @algekalipso made a post in January which illustrates the formal structure one might expect: a causal network that is dynamic like Wolfram's hypergraphs rather than fixed like Conway's Game of Life, and in which the monads have significant dynamic internal structure too. In terms of physical theory, these monads might be "blocks of entanglement" or "geometric atoms" or some other ultimate constituent. The postulate of panprotopsychism here means that awareness, and its more elaborate forms like consciousness or subjectivity, arises when these monads possess the appropriate internal structure.
One may ask why I am supposing this somewhat exotic theory of the conscious mind, in which a person is some kind of nonseparable quantum state of the brain, rather than a more conventional information-processing model, in which they are a particular virtual state machine existing more at the level of neurons than at the level of quantum physics. The reason is just that I consider the exotic option more likely. However, the reader may wish to substitute Markov blankets for monads if they prefer the conventional model.
So we have our ontology: the physical world is made of @algekalipso's process-topological monads, interacting according to some fundamental psychophysical law analogous to the rules that govern cellular automata, with consciousness occurring only in monads with the right intricate internal structure.
The ethical hypothesis is that the ultimate value system is governed by an appropriate aggregation of the valences and preferences of all the conscious monads. Valences here are pleasure, pain, and possibly other kinds of qualic intrinsic value. Preferences are included so that more abstract dispositions like judgments and decisions can be counted too. I won't concern myself with the details of the aggregation here, utilitarian theory offers many possibilities, I will merely suppose that it has been determined by some CEV-like process.
So we have a world model and a value system, a value system which is to govern the transhuman civilization of the future. How should we imagine this working? We can expect the future to be extremely complex and diverse by human standards, populated with entities at very different levels of intelligence... It may be abstract and uninspiring for some people, to think of it this way, but I can envision this ultimate value system having a status in the transhuman civilization, that an economic philosophy or ideology can have in human civilization. Call this the political hypothesis. Just as the members of human civilization vary greatly in their knowledge of economics, from ignorant to expert, but all the big organizations of human society have to pay it some heed, and just as human governments monitor economic data, develop economic policy, and implement it using institutions of law and power... so too, analogous arrangements may exist throughout transhuman civilization, with respect to its ultimate value system.
This "political" scenario is not a normative proposal. The only normative thing is the ultimate value system, and it should determine the political order of the transhuman civilization that it governs. I'm just giving the vaguest of speculative sketches as to what that world would look like, and how it would run.
But now we come to the real crux. Back in the present, we live in a world where a few giant companies are pushing the capabilities of AI ever further forward, using architectures that are basically augmented transformers. Suppose that one of these companies will imminently align one of the superintelligent augmented transformers, with the world model and value system mentioned above, and that this superaligned AI goes on to be the seed of transhuman civilization. How is the alignment done?
You might suppose that this is easier than the value learning which happens at present, because it involves a rigorously specified target. It's like getting the AI to learn the rules of a game (the world model) and the conditions of victory (the value system), just as AlphaGo and AlphaZero did. By comparison, today's value learning occurs implicitly, as part of the world modeling that a new AI does, on the basis of its training corpora.
There may be two things to accomplish here. One is to create a subsystem which becomes expert at the theory of the world model and the value system. The prototype here would be all the efforts being made to use AI in formalization, hypothesis generation, theorem proving, and so on. The other is to connect that theoretical knowledge to the general practical knowledge that gets extracted from the big training corpora, so that the AI can interpret the concrete world in terms of its ultimate world model. The mechanics of this will depend on the details of the AI training pipeline used by the company in question.
Another likely consideration is that you may not wish your super-aligned super-AI to just believe dogmatically in the ultimate world model and ultimate value system. It should have an autonomous epistemic capacity for critical thought, by means of which it can test their robustness. If the world model and value system were already obtained purely by human effort, they should already have been subjected to some skeptical testing. If AI already played a role in deriving them, then they have already been subjected to some machine epistemology... I need to think further about this aspect.
That concludes the sketch for now. I have skipped over all the genuinely technical issues of AI safety that come into play when you reach the level of detailed architectures, the technical issues whose unresolved nature makes people so alarmed about the creation of superintelligence at the current level of knowledge. What I wanted to do was to outline a rather classical scenario of alignment against an exact value system, in a way that dovetails with the world of 2026. It seems like an agenda that could already be pursued to the end - I don't see any fundamental barriers. It makes me wonder what similar (but far more detailed) position papers and contingency plans may already exist inside the frontier AI companies.