A recent article by Max Hellriegel Holderbaum and me raised the worry that AI alignment might be impossible in principle. The article has not gotten the attention it deserves and I suspect there are three main reasons for this. First, it is quite long. Secondly, a huge chunk of it develops an intuition-based argument that is inessential to the article's main point. Third, we inadvisably released it on April fools. In this post I will try to reconcile these flaws and present what I take to be our main point in a more concise fashion.

Very roughly, the original article argued that, first, value-aligning an AI system requires that we are able to predict some aspects of its behavior, and secondly, that there are reasons to believe that such predictions may be impossible for systems that are sufficiently intelligent. As a result, we worry that the alignment paradigm may be an infertile framework for thinking about AI safety. Lets go through these points in order.

Prediction

The first thesis is that to value-align a given AI system we have to be able to predict some aspects of its behavior

Predicting the behavior of complex computational systems is hard, and in many instances impossible if by prediction we mean knowing the result of a computational procedure without running it. The impossibility of making such predictions reliably arguably is a motivating factor behind the shift from the control problem, the problem of controlling AI systems once they are created, to the alignment problem, the problem of creating AI systems that have goals and values aligned with our own. The emphasis on values frees one from the obligation of making precise prediction about what some AI system will do. If it could be shown however, that knowing the values of some system requires that we are able to predict some aspects of its actions, then the emphasis on alignment would be in vain.

To see that prediction is primary to alignment I will consider three popular ways of thinking about utility functions and show that none of them can justify the emphasis on alignment over prediction. The first view is the dispositionalist one. Here utility functions are conceived as generalizations about the actual dispositions a given system has. Rabbits like carrots in the sense that they tend to eat them. Such an understanding of utility functions is prevalent both in the decision theoretic paradigm of conceptualizing them in terms of betting dispositions, as well as in contemporary active inference theories of cognition that frame utility functions in terms of random dynamical attractors. In the context of the present discussion it should be evident that achieving value alignment presupposes extensive predictive abilities regarding the system in question.

A second view of utility functions is learning-theoretic. It holds that utility functions are things that intelligent systems abstract from their training data. For instance, reinforcement learning systems learning to play chess typically learn that there are more or less desirable board positions from relatively sparse training data. However, it is a well known problem that on the learning-theoretic understanding of utility functions alignment researchers are faced with the problem of knowing the utility function a system learned. One can figure this out by running it, which is feasible for chess systems but not for potentially dangerous AGIs, or by having some independent predictive strategy. Once again it turns out that knowledge of utility functions requires prediction.

A third view of utility functions is that they are hard-wired features of AI systems that are directly engineered into them. The problem with this view is that it is a technological fantasy. Current AI systems do not have hard-wired utility functions over and above loss functions and knowing its loss function is no great help in predicting a system. There is no reason to believe that this will change any time soon, or ever for that matter.

As an aside, the alignment paradigm of reinforcement learning by human feedback (RLHF) offers an interesting case study here. This approach precisely does not answer the demand for predictability because it does not render the systems in questions any more predictable. That is to say that, without some additional method for predicting what the AI systems in question will do, we simply cannot know whether RLHF will work, i.e. whether the human feedback will generalize to the contexts we want the system to be safe in.

I hope it has become clear that the difficulties for the alignment paradigm are quite systematic. As we said in our original article, the prediction problem, the problem of predicting the behavior of AI systems, is more fundamental than either the control problem or the alignment problem. And while the alignment problem may seem hard but solvable, it is unclear whether the prediction problem is solvable at all.

Computation

The second thesis is that predicting the behavior of advanced AI systems may be strictly impossible. 

This is not to say that we are certain on this, but we think there are some reasons to take this possibility very seriously. In our original article Max and me investigated two arguments to this conclusion. One of these is built upon computability theory, the other is built on intuitive assumptions about the structure of intelligence. As the latter argument requires detailed discussion and this article is supposed to serve as a brief overview, I will here focus on the argument from computability.

The argument from computability goes like this. There cannot be an algorithm that decides for every Turing machine whether it will eventually halt. That's the halting problem. It is easy to derive the conclusion that there cannot be an algorithm that decides for any other algorithm whether that algorithm will do . For there is a possible algorithm that, for some input, feeds that input into an arbitrary Turing machine, and if that Turing machine halts, it does . Any algorithm that is able to predict this algorithm will be able to solve the halting problem and thus there is no such general prediction algorithm.

On a more general note, it seems that where complex computational procedures are involved, unpredictability is the default. The relevance of this fact has been questioned on the grounds that, just because there can be no general prediction procedure, there may still be particular AI systems that are perfectly predictable. Compare: There cannot be a general algorithm for predicting whether a given system will behave like a calculator for some input, for there is some possible system that will run any arbitrary Turing machine and behave like a calculator if its halts. But this does not show that one cannot build reliable calculators!

Still it seems that, given that predictability is not guaranteed, the burden of proof is shifted to AI researchers. In order to guarantee that some AI system is safe one would have to show at least that it is not computationally universal in the sense that, in the course of its working, it will not implement or "make use of" arbitrary Turing machines. For if it does then computability theory tells us that its behavior will escape our predictive grasp. It is an open question whether the capacities of a system can be flexible enough to be properly generally intelligent and still constrained enough to be predictable for its creators.

Fire

The third and final thesis is that we should be willing to contemplate the possibility that AGI is inherently unsafe and should never be developed.

The point of our argument is not stating an actuality but raising a possibility. We currently do not know whether AGI can be made safe. Thus it is important that we do not think about AI safety in terms of the limiting paradigm of AI alignment where AI safety is synonymous with thinking about how to engineer safety into the systems themselves. 

What would a more inclusive paradigm look like? In my view, the taming of fire offers a useful analogy here. The taming of fire certainly was one of the most important achievements in the history of mankind. However, just because you can make it does not mean that it is a good idea to make as much of it as possible. And making fire safe does not mean to engineer safety into it from the get go but rather to figure out where to use it to solve tightly constrained tasks without burning the house down.

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 8:34 AM

A couple of points: one, you are right to note that very few algorithms can be analyzed without executing them, the reason is the same as that most numbers are not algebraic, most sequences are not compressible (with an a priory compressor) etc. That is why computer programs are buggy and the bugs are not easy to find: any deviation from intention pushes an intended algorithm into an incompressible territory. Two, the conclusion that "AGI is inherently unsafe" not just "inherently unpredictable", relies on an unstated definition of safety. If you agree with Eliezer and others that "safety we care about" is a measure zero among all possible superintelligent AIs, then yes, your Fire thesis follows. However, this point is often taken on faith (in AI doomers), and has not been formalized, as far as I know. Eliezer's argument is basically "there are many disjunctive ways to AI killing everyone, and probably a small number of conjunctive ways to avoid that". While believable, it is far from proven or even formalized.

I have problems getting the first point. If bugs are hard to find then shouldn't this precisely entail that dangerous AI is hard to differentiate from benign AI?! Any literature you can suggest on the subject?

Regarding the second point. I don't find Eliezer's idea entirely convincing. But I don't think the fire thesis hinges on his view. Rather, it is built on the much weaker and simpler view that if we don't know the utility function of some AGI system then this system is dangerous - I find it very hard to see any convincing reasons for thinking this is false. Eliezer thinks doom is default. I just assume that ignorance makes it rational to air on the side of caution.

Are you arguing that AGI can't be guaranteed to be safe? Or that it's guaranteed to be destructive to human preferences? If the first, I take that to be old news.

I've long ago given up on guarantees of safety, and I read most of the community as being in a similar boat.

The primary thing I am arguing, as Max as already said, is that the AI alignment paradigm obscures the most fundamental problem of AI safety: That of prediction. It does this by conflating various interpretations of what values or utility functions are.

One of the most fundamental insights entailed by a move from an alignment to a predictive paradigm is that is becomes far from clear whether the relevant problems are solvable. 

Nothing in this shows that AGI is "guaranteed to be destructive to human preferences" of course. Rather, it shows that various paradigms that one may choose to try to make AI safe, like RLHF, should actually not make us more confident in our AGI systems at all because they address the wrong questions: We can never know in advance whether they will work and this is true for all paradigms that try to sidestep the prediction problem by appealing to values (bracketing hard-wired values of course).

I'd just say that we've never known anything in advance for certain. The idea that we'd be able to prove analytically that an ASI was safe was always hopeless. And that's been accepted by most of the alignment community for some time. I don't see that this perspective changes the predictive power of the theories we're applying.

I think the interesting discussion is not about how certain exactly our predictions of doom are or can be.

 Let me put the central point another way: However pessimistic you are about the success of alignment you should become more pessimistic once you realize that alignment requires the prediction of an AIs actions. Any notion that we could circumvent this by engineering values into the system is illusory.

Neither captures it quite imo. I think it’s mostly an attempt at deconfusion: 

  1. We can’t hope to solve alignment by sufficiently nudging the relevant AI's utility function since to know something about the utility function (as argued here) requires either predicting it (not just tweaking it & crossing your fingers really hard) or predicting the AI's behavior. This is a substantially harder problem than the term alignment suggests on the surface and it’s one that it seems we cannot avoid. Interpretability (as far as I'm aware) is nowhere near this. The prediction section makes an epistemic argument though. It suggests that solving the alignment problem is much harder than one might think, not how likely doom by default is. For this, we would need to know the sampling space we'll be in when developing AGI, and as shminux points out, a definition that delineates the safe part of that sampling space.
  2. The second thesis is more strictly on the impossibility of safety guarantees but I think it’s most interesting in conjunction with the first one: If we need prediction for alignment and prediction is impossible for computationally universal systems, then the alignment problem may also be impossible to solve for such systems. Thus, the fire thesis (which is a possibility we should consider, nothing that’s shown here. There might be all kinds of informal clues for safety; a bit more on that in the discussion section here)