A putative new idea for AI control; index here.
The insight this post comes from is a simple one: defining concepts such as “human” and “happy” is hard. A superintelligent AI will probably create good definitions of these, while attempting to achieve its goals: a good definition of “human” because it needs to control them, and of “happy” because it needs to converse convincingly with us. It is annoying that these definitions exist, but that we won’t have access to them.
Modelling and defining
Imagine a game of football (or, as you Americans should call it, football). And now imagine a computer game version of it. How would you say that the computer game version (which is nothing more than an algorithm) is also a game of football?
Well, you can start listing features that they have in common. They both involve two “teams” fielding eleven “players” each, that “kick” a “ball” that obeys certain equations, aiming to stay within the “field”, which has different “zones” with different properties, etc...
As you list more and more properties, you refine your model of football. There are some properties that distinguish real from simulated football (fine details about the human body, for instance), but most of the properties that people care about are the same in both games.
My idea is that once you have a sufficiently complex model of football that applies to both the real game and a (good) simulated version, you can use that as the definition of football. And compare it with other putative examples of football: maybe in some places people play on the street rather than on fields, or maybe there are more players, or maybe some other games simulate different aspects to different degrees. You could try and analyse this with information theoretic considerations (ie given two model of two different examples, how much information is needed to turn one into the other).
Now, this resembles the “suggestively labelled lisp tokens” approach to AI, or the Cyc approach of just listing lots of syntax stuff and their relationships. Certainly you can’t keep an AI safe by using such a model of football: if you try an contain the AI by saying “make sure that there is a ‘Football World Cup’ played every four years”, the AI will still optimise the universe and then play out something that technically fits the model every four years, without any humans around.
However, it seems to me that ‘technically fitting the model of football’ is essentially playing football. The model might include such things as a certain number of fouls expected; an uncertainty about the result; competitive elements among the players; etc... It seems that something that fits a good model of football would be something that we would recognise as football (possibly needing some translation software to interpret what was going on). Unlike the traditional approach which involves humans listing stuff they think is important and giving them suggestive names, this involves the AI establishing what is important to predict all the features of the game.
We might even combine such a model with the Turing test, by motivating the AI to produce a good enough model that it could a) have conversations with many aficionados about all features of the game, b) train a team to expect to win the world cup, and c) use it to program successful football computer game. Any model of football that allowed the AI to do this – or, better still, that a football-model module that, when plugged into another, ignorant AI, allowed that AI to do this – would be an excellent definition of the game.
It’s also one that could cross ontological crises, as you move from reality, to simulation, to possibly something else entirely, with a new physics: the essential features will still be there, as they are the essential features of the model. For instance, we can define football in Newtonian physics, but still expect that this would result in something recognisably ‘football’ in our world of relativity.
Notice that this approach deals with edge cases mainly by forbidding them. In our world, we might struggle on how to respond to a football player with weird artificial limbs; however, since this was never a feature in the model, the AI will simply classify that as “not football” (or “similar to, but not exactly football”), since the model’s performance starts to degrade in this novel situation. This is what helps it cross ontological crises: in a relativistic football game based on a Newtonian model, the ball would be forbidden from moving at speeds where the differences in the physics become noticeable, which is perfectly compatible with the game as its currently played.
Now we take the next step, and have the AI create a model of humans. All our thought processes, our emotions, our foibles, our reactions, our weaknesses, our expectations, the features of our social interactions, the statistical distribution of personality traits in our population, how we see ourselves and change ourselves. As a side effect, this model of humanity should include almost every human definition of human, simply because this is something that might come up in a human conversation that the model should be able to predict.
Then simply use this model as the definition of human for an AI’s motivation.
What could possibly go wrong?
I would recommend first having an AI motivated to define “human” in the best possible way, most useful for making accurate predictions, keeping the definition in a separate module. Then the AI is turned off safely and the module is plugged into another AI and used as part of its definition of human in its motivation. We may also use human guidance at several points in the process (either in making, testing, or using the module), especially on unusual edge cases. We might want to have humans correcting certain assumptions the AI makes in the model, up until the AI can use the model to predict what corrections humans would suggest. But that’s not the focus of this post.
There are several obvious ways this approach could fail, and several ways of making it safer. The main problem is if the predictive model fails to define human in a way that preserves value. This could happen if the model is too general (some simple statistical rules) or too specific (a detailed list of all currently existing humans, atom position specified).
This could be combated by making the first AI generate lots of different models, with many different requirements of specificity, complexity, and predictive accuracy. We might require some models make excellent local predictions (what is the human about to say?), others excellent global predictions (what is that human going to decide to do with their life?).
Then everything defined as “human” in any of the models counts as human. This results in some wasted effort on things that are not human, but this is simply wasted resources, rather than a pathological outcome (the exception being if some of the models define humans in an actively pernicious way – negative value rather than zero – similarly to the false-friendly AIs’ preferences in this post).
The other problem is a potentially extreme conservatism. Modelling humans involves modelling all the humans in the world today, which is a very narrow space in the range of all potential humans. To prevent the AI lobotomising everyone to a simple model (after all, there does exist some lobotomised humans today), we would want the AI to maintain the range of cultures and mind-types that exist today, making things even more unchanging.
To combat that, we might try and identify certain specific features of society that the AI is allowed to change. Political beliefs, certain aspects of culture, beliefs, geographical location (including being on a planet), death rates etc... are all things we could plausibly identify (via sub-sub-modules, possibly) as things that are allowed to change. It might be safer to allow them to change in a particular range, rather than just changing altogether (removing all sadness might be a good thing, but there are many more ways this could go wrong, than if we eg just reduced the probability of sadness).
Another option is to keep these modelled humans little changing, but allow them to define allowable changes themselves (“yes, that’s a transhuman, consider it also a moral agent.”). The risk there is that the modelled humans get hacked or seduced, and that the AI fools our limited brains with a “transhuman” that is one in appearance only.
We also have to beware of not sacrificing seldom used values. For instance, one could argue that current social and technological constraints mean that no one has today has anything approaching true freedom. We wouldn’t want the AI to allow us to improve technology and social structures, but never get more freedom than we have today, because it’s “not in the model”. Again, this is something we could look out for, if the AI has separate models of “freedom” we could assess and permit to change in certain directions.