Designing an aligned AI system is hard, in part because we don't yet know what AI will look like. Alignment researchers are currently in the position of someone trying to ensure the safety of an airplane design without knowing combustion, aerodynamics or structural engineering. Typically the mathematical theory of engineering disciplines are elaborated after the disciplines are fairly mature, with the help of many real-world models and known results.
We might not want to do this with AI. By the time AI systems are powerful enough to serve as a good working model, there may be too little time left to prevent a catastrophically mis-aligned AI from being built. In fact, the problem is worse: even in the absence of careless actors, we may not want to run any program powerful enough to serve as a working model for general intelligence without already having a theory that explains why it works. If we don't understand why a given program works, it will be hard to distinguish between "sub-human and good model of AGI" and "dangerously superhuman". So we still, ideally, would like to have a model of AGI, and a theory of why it works, before running any code to test this model.
(Of course, there are some programs implementing intelligence-like behaviour we can run without fear of a treacherous turn, like deep nets. However, as weapproach more human-like and general behavior, and as such systems are deployed in broader settings with e.g. internet access, this possibility will become more realThe point is: there is a threshold of intelligence at which we would prefer not to run programs without understanding deeply why they work, and that threshold is below human intelligence.)
*To be clear, I don't think things will go like this in the real world -- in all likelihood our understanding of AI systems will lag behind the implementation to the point of general intelligence and beyond. But we can at least try to reduce the 'understanding gap' so that we will be in a better position to act when we do approach the level of AGI. To the extent theorists can have a positive impact, it will be by pushing forward theory relative to practice.
So we can divide the problem of AI alignment into two sub-tasks:
- Come up with a model for general AI might work, in the absence of such an AI. Understand how this model will work and why it will be powerful in great detail.
- Using this model, design an AI system that will predictably be aligned with human interests.
When first considering the problem, people tend to focus on subtask #2. It is my belief (I think shared by many around here) that #1 is the greater challenge and greater part of the work.
To my knowledge, there are currently two main classes of models being considered by AI alignment researchers.
- Logic-based models, mainly considered by MIRI. An intelligent agent is modeled as a process evaluating logical formulae, or a computer program with unlimited processing time. (I believe) these are valued mainly for their universality: any system we can call intelligent will have to be able to do something like logical reasoning, will have to overcome something like the Lobian obstacle, etc.
- Machine-Learning based models, mainly considered by Paul Christiano and others at OpenAI. An intelligent agent is a program using souped-up version of current machine-learning methods. Valued because of their relevance to current research in deep learning, which seems like it might actually end up a component of some eventual AGI.
I assert that we don't understand intelligence well enough to be confident that eventual AIs will resemble either of these models. I'd expect that the researchers working on each model would agree. Even if we think there is only a moderate chance of AI systems resembling either model in some way, understanding how they work and how to align them is a worthy endeavour.
I would like to propose another possible way of thinking about intelligence and optimization. It has already been discussed around here(to put it mildly). But to my knowledge nobody has proposed constructing a formal mathematical model of this process.
Currently we only have one example of a very powerful optimization process: humans. Unfortunately we don't have the ability to write down human deliberation as a computer program.
However, there is a sense in which we know of a computer program that can implement human intelligence. Humans were produced by natural selection, which arose naturally on Earth. It seems that the conditions that led to the formation of Earth resulted from the time evolution of a fairly simple initial state.
Hence, there is at least one short computer program capable of producing powerful optimization processes: the laws of physics, running on the configuration of the Big Bang for 14 billion years. It's a sort of model of AI: a computer program that we have observed to exhibit intelligent behavior. What we would like is to develop a theory that explains WHY this is. Basically, the question is: given a computer program running something like physics, can we give a mathematical argument as to why we should EXPECT things like life and intelligence to form?
It would be great if we could get something like: "Running cellular automaton X starting from configuration Y for T timesteps will generate complexity of level L" "Complexity" would mean things like(in order of increasing complexity): self-replicating entities, life, intelligence, intergalatic civilizations, and so forth. Kolmogorov complexity isn't really what we're looking for as it seems that everything around us can be generated by a very short program, namely physics running from the Big Bang. A totally random string doesn't really seem 'complex' in this sense.
A theory like this could perhaps let us formalize(or semi-formalize) the reasoning in this post: https://ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/ i.e. we could explain WHY we expect certain short computer programs ought to give rise to intelligent agents, why they would be motivated to manipulate the prior, etc. Currently, we don't really have any reason to think this beyond the fact that our physics seems like a short program. Further, we could perhaps predict WHICH programs are most likely to do this(although in general this will almost certainly be uncomputable)
More broadly, understanding in what conditions optimization will 'emerge' from some sort of seemingly simple mindless process with lots of computing power seems valuable. Future AI systems will likely contain many different sub-processes, each running complex computations on a lot of computing power. These systems could end up exhibiting a sort of agency of their own -- it would be good if we could know when that is likely to occur.
But first, some may be wondering why we should even expect there to be some sort of simple mathematical theory of 'complexity'.
Let's be clear on what we're looking for. We wouldn't be looking for a theory that could predict the evolution of humans on Earth after 4 billion years, given the laws of physics.
That's far too specific. But we might be able to predict something like this: "With high probability: after [X] billion years, at least 1 planet in [region] will have life with at least complexity [L]" Or focusing more on Earth, we probably couldn't predict that chimpanzees in particular were likely to go on to conquer the planet. But we could perhaps predict that SOME social, intelligent species was likely to arise and obtain substantial power.
Now why ought such a theory to exist? I don't have any knock-down arguments, but here are two possible reasons:
-- There is a pretty simple principle that explains a lot of the complexity around us: evolution. Biological evolution has produced the astounding variety and complexity of life around us. A lot of the success of humans can be explained in terms of cultural evolution, our ability to copy and preferentially imitate the behavior patterns of others, keeping patterns that work and discarding those that don't. (See The Secret of Our Success)The logic of darwinisim is simple and mathematical. How exactly to define what 'darwinist evolution' is in terms of the underlying computer program of physics is an open problem.
-- The development and increase of 'complexity' seems like a fairly robust phenomenon in the history of our planet. We have gone from a random sea of chemicals to life forms of steadily increasing complexity, from pre-programmed responses to behavior to human intelligence, from hunter-gatherer bands to modern technological civilization. This process has recovered from numerous disruptions such as mass exinctions and civilizational collapses. For many transitions we have made -- such as the transition to intelligent life --we can see in retrospect an alternative way it could have come about, e.g. cephalopods seem to have developed intelligence independently of humans. Overall the picture is of a robust process, not a random series of lucky breaks. Robust phenomena are more likely to have simple explanations, because complex explanations can be disrupted at many points.
The first thing we would need for such a theory is a satisfactory definition of 'complexity'. Kolmogorov complexity isn't adequete because our universe seems to be generated from a simple initial state by a simple series of laws, so the kolmogorov complexity has remained roughly constant over its history. We want our measure of complexity to increase over time.
I've come up with a candidate: https://www.lesserwrong.com/posts/ra4yAMf8NJSzR9syB/a-candidate-complexity-measure. A few interesting possible definitions are also listed here: https://arxiv.org/pdf/1405.6903.pdf though I think they all suffer from the same problems as Kolmogorov complexity(except perhaps logical depth) In all likelihood, if a succesful theory is ever developed the definition of complexity will likely change several times as the theory is elaborated.
I think that this process -- the process of intricate structure emerging from a simple and deterministic set of rules applied for a long enough time -- is likely a key part of the deep structure of thought and intelligence. If we can understand how life and intelligence came to be, we may be close to understanding how they work.