Criticism of the main framework in AI alignment

Michele Campolo

Originally posted on the EA Forum for the Criticism and Red Teaming Contest.

0. Summary

AI alignment research centred around the control problem works well for futures shaped by out-of-control misaligned AI, but not that well for futures shaped by bad actors using AI. Section 1 contains a step-by-step argument for that claim. In section 2 I propose an alternative which aims at moral progress instead of direct risk reduction, and I reply to some objections. I will give technical details about the alternative at some point in the future, in section 3.

The appendix clarifies some minor ambiguities with terminology and links to other stuff.

1. Criticism of the main framework in AI alignment

1.1 What I mean by main framework

In short, it’s the rationale behind most work in AI alignment: solving the control problem to reduce existential risk. I am not talking about AI governance, nor about AI safety that has nothing to do with existential risk (e.g. safety of self-driving cars).

Here are the details, presented as a step-by-step argument.

At some point in the future, we'll be able to design AIs that are very good at achieving their goals. (Capabilities premise)
These AIs might have goals that are different from their designers' goals. (Misalignment premise)
Therefore, very bad futures caused by out-of-control misaligned AI are possible. (From previous two premises)
AI alignment research that is motivated by the previous argument often aims at making misalignment between AI and designer, or loss of control, less likely to happen or less severe. (Alignment research premise).

Common approaches are ensuring that the goals of the AI are well specified and aligned with what the designer originally wanted, or making the AI learn our values by observing our behaviour. In case you are new to these ideas, two accessible books on the subject are [1,2].

5. Therefore, AI alignment research improves the expected value of bad futures caused by out-of-control misaligned AI. (From 3 and 4).

By expected value I mean a measure of value that takes likelihood of events into account, and follows some intuitive rules such as "5% chance of extinction is worse than 1% chance of extinction". It need not be an explicit calculation, especially because it might be difficult to compare possible futures quantitatively, e.g. extinction vs dystopia.

I don't claim that all AI alignment research follows this framework; just that this is what motivates a decent amount (I would guess more than half) of work in AI alignment.

1.2 Response

I call this a response, and not a strict objection, because none of the points or inferences in the previous argument is rejected. Rather, some extra information is taken into account.

6. Bad actors can use powerful controllable AI to bring about very bad futures and/or lock-in their values (Bad actors premise)

For more information about value lock-in, see chapter 4 of What We Owe The Future [3].

7. Recall that alignment research motivated by the above points makes it easier to design AI that is controllable and whose goals are aligned with its designers' goals. As a consequence, bad actors might have an easier time using powerful controllable AI to achieve their goals. (From 4 and 6)

8. Thus, even though AI alignment research improves the expected value of futures caused by uncontrolled AI, it reduces the expected value of futures caused by bad human actors using controlled AI to achieve their ends. (From 5 and 7)

This conclusion will seem more, or less, relevant depending on the beliefs you have about its different components.

An example: if you think that futures shaped by malevolent actors using AI are many times more likely to happen than futures shaped by uncontrolled AI, the response will strike you as very important; and vice versa if you think the opposite.

Another example: if you think that extinction is way worse than dystopic futures lasting a long time, the response won't affect you much—assuming that bad human actors are not fans of complete extinction.

If one considers both epistemic and moral uncertainty, the response works like a piece in the puzzle of how to evaluate AI alignment research. Other points can be made and balanced against the conclusion above, which can't establish by itself that AI safety research is overall net good or bad or neutral. At the same time, deciding to completely ignore it would likely be a case of biased reasoning, maybe motivated.

2. An alternative to the main framework

2.1 Moral progress as a goal of alignment research

Research that is not vulnerable to the response has to avoid point 7 above, i.e. it must not make it easier to create AI that helps malevolent actors achieve their goals.

Section 3 in Artificial Intelligence, Values, and Alignment [4] distinguishes six possible goals of AI alignment. The first three—alignment with instructions, expressed intentions, or revealed preferences—follow the main framework above. The other three focus less on the control problem, and more on finding an interpretation of ‘good’ and then making AI do good things. Thus, the latter three are less (or not at all) vulnerable to the response above.

If you are at all curious about AI safety, I suggest that you have a look at Gabriel's paper, it contains many excellent ideas. But it misses one that is, for lack of a better word, excellenter. It’s about building AIs that work like independent thinkers, then using them for moral progress.

This kind of AI does not do what its designer wants it to do, but rather does what it wants—to the same extent that humans do what they want and generally don’t limit themselves to following instructions from other humans. Therefore, the response above doesn’t apply.

The key point, which is also what makes this kind of AI useful, is that its behaviour is not completely arbitrary. Rather, this AI develops its own values as it learns about the world and thinks critically about them, as humans do as they go through their lives.

As it happens with humans, the end result will depend on the initial conditions, the learning algorithm, and the learning environment. Experimenting with different variations of these may expose us to an even greater degree of cultural, intellectual, and moral diversity than what we can observe today. One of the advantages of using AIs is that we can tweak them to remove biases of human reasoning, and thus obtain thinkers that are less irrational and less influenced by, for example, one’s skin colour. These AIs may even spot important injustices that are not widely recognised today—for comparison, consider how slavery was perceived centuries ago.

Chapter 3 and the section Building a Morally Exploratory World in chapter 4 of [3] contain more information about the importance of values change and moral progress.

2.2 Some considerations and objections to the alternative

Even though I cited [3] on more than one occasion, I think that pretty much all the content of the post applies to both short-term and long-term future.
I do not claim that research towards building the independent AI thinkers of 2.1 above is the most effective AI alignment research intervention, nor that it is the most effective intervention for moral progress. I’ve only presented a problem of the main framework in AI alignment, and proposed an alternative that aims to avoid that problem. As someone else would say: beware surprising and suspicious convergence.
Research on AI that is able to think critically about goals may be useful to reduce AI risk, even if no independent AI thinkers are built, since it may lead to insights on how to design AI that doesn’t just optimise for a specified metric.
Objection: Bad actors could build or buy or select independent AI thinkers that agree with their goals and want to help them.

Reply: True to a certain extent, but seems unlikely to happen and easier said than done. I think it’s unlikely to happen because bad actors would probably opt to use do-what-I-want AI, instead of producing a lot of independent AI thinkers with the hope that one of them happens to have goals that are very aligned with what the bad actors themselves want. And in the latter case, bad actors would also have to hope that the AI goals won’t change over time. Overall, this objection seems strong in futures in which research on independent AI thinkers has advanced to the point of outperforming research on do-what-I-want AI: a very unlikely scenario, considering that the latter kind of research is basically all AI research + most AI alignment research.

Objection: The proposed alternative can actually create bad actors.

Reply: True, some independent AI thinkers might resemble, for example, dictators of the past, if the initial conditions and learning algorithm and learning environment are appropriate. However, at least initially, they would not already be in a position of power with respect to other humans, and they would have to compete also with the other independent thinkers if they have different goals. The main difference with section 1 above is that we are not talking about very powerful or superintelligent AI here. My guess is that bad actors created this way would be roughly as dangerous as human bad actors. Unfortunately, many new humans are born every day, and some of them have bad intentions.

Objection: The proposed alternative requires human-level AI.

Reply: One can continue the objection in different ways.

“...Therefore it’s dangerous.”: See last part of the above reply.
“...Therefore it isn’t very useful.”: One may claim this if they believe, for example, that we will build very powerful and superintelligent AI shortly after the first human-level AI is built, and that at that point we’ll be doomed to dystopia or extinction, so there won’t be time for AI experiments and moral progress. I don’t know how to reply to this objection without attacking the beliefs I've just mentioned. However, if you think the proposed alternative is not very useful for a different reason, you can leave a comment and I’ll try to reply.
“...”: Sometimes people end the objection there. If we were able to increase mind diversity and foster moral progress by using AI that is below human level of intelligence, that would be great! I don’t exclude that it’s possible, but it might require extra research.

3. Technical details about the alternative

This section is not ready yet. When it will be ready, I’ll publish the complete version on the Alignment Forum and leave a link here.

In short, the main point is that at the moment we don’t know how to build AI that thinks critically about goals as humans do. That’s one of the reasons why I am doing research on it.

As far as I know, no one else in AI safety is directly working on it. There is some research in the field of machine ethics, about Artificial Moral Agents, that has a similar motivation or objective. My guess is that, overall, very few people are working on this.

Update: you can find some details here. I'll also publish a follow-up post to that one, with more guidelines on how to build AI that is capable of unbiased moral reasoning.

References

[1] Russell, Stuart. Human compatible: Artificial intelligence and the problem of control. Penguin, 2019.

[2] Christian, Brian. The alignment problem: How can machines learn human values?. Atlantic Books, 2021.

[3] MacAskill, William. What We Owe the Future. Hachette UK, 2022.

[4] Gabriel, Iason. "Artificial intelligence, values, and alignment." Minds and machines 30.3 (2020): 411-437.

Appendix

Terminology

When I use the term ‘AIs’, I mean multiple artificial intelligences, e.g. more than one AI program. When I use the term ‘AI’, I mean one or more artificial intelligences, or I may use it as a modifier (as in ‘AI safety’). The distinction is not particularly important, and in this post I simply use what seems more appropriate to the context.
When I write “by expected value I mean a measure of value […]”, I use ‘measure’ with its common-sense meaning in everyday language, not as the mathematical definition of measure.
- I’m assuming extinction is bad, as you can guess from that paragraph. You might think otherwise and that’s fine: if you believe extinction is not bad, then you probably don’t like x-risk motivated research in the first place and you don’t need the argument in section 1 to evaluate it.
Value lock-in, as defined in Chapter 4 of What We Owe The Future: "an event that causes a single value system, or set of value systems, to persist for an extremely long time."

Other stuff

You can find more criticism of AI safety from EAs here. The difference with this post is that there are many more arguments and ideas, but they are less structured.

In the past I wrote a short comparison between an idea similar to 2.1 and other alignment approaches, you can find it here.

This work was supported by CEEALAR, but these are not CEEALAR’s opinions. Note also that CEEALAR doesn't support me to insert questionable humour in my posts: I do it on my own initiative.

Thanks to Charlie Steiner for feedback.

7. Recall that alignment research motivated by the above points makes it easier to design AI that is controllable and whose goals are aligned with its designers' goals.
As a consequence, bad actors might have an easier time using powerfull controllable AI to achieve their goals. (From 4 and 6)
8. Thus, even though AI alignment research improves the expected value of futures caused by uncontrolled AI, it reduces the expected value of futures caused by bad human actors using controlled AI to achieve their ends. (From 5 and 7)
This conclusion will seem more, or less, relevant depending on the beliefs you have about its different components.

It sounds to me like the claim you are making here is "the current AI Alignment paradigm might have a major hole, but also this hole might not be real". But then the thrust of your post is something like "I am going to work on filling this hole". You invoke epistemic and moral uncertainty in a somewhat handwavy way which leaves me skeptical. It's not clear to me what you believe, so it is hard for me to productively disagree or provide useful feedback. Assuming you are going to spend many hours working on this research direction, I think it's worth spending a few hours on determining if this proposed problem is in fact a problem, including making some personal guesses about the value of various futures (maybe you've already done this privately).

You later write:

I do not claim that research towards building the independent AI thinkers of 2.1 above is the most effective AI alignment research intervention, nor that it is the most effective intervention for moral progress. I’ve only presented a problem of the main framework in AI alignment, and proposed an alternative that aims to avoid that problem

To me, it's not obvious that the thing you presented is actually a problem. My quick thoughts: extinction is quite bad, some types of galaxy-spanning civilizations are far worse than extinction, but many are better, including some of the ones I think would be created by current "bad actors".

I'm furthermore unsure why the solution to this proposed problem is to try and design AIs to make moral progress; this seems possible but not obvious. One problem with bad actors is that they often don't base their actions on what the philosophers think is good (e.g., dictators don't seem concerned with this). On the other hand, perhaps the "bad actors" you are targeting are average Americans who eat a dozen farmed animals per year, and these are the values you are most worried about. Insofar as you want to avoid filling the universe with factory farming, you might want to investigate current approaches in Moral Progress or moral circle expansion; I suspect an AI approach to this problem won't help much. There's a similar story for impact here that looks like "get the Great Reflection started earlier" which I am more optimistic about, but I suspect to fail for other reasons. Not sure if this paragraph made sense; I'm gesturing at the fact that the "bad actors" you are targeting will affect what research directions to pursue, and for the main class of bad actors that comes to mind with that word, moral progress seems unlikely to help.

Sorry for the late reply, I missed your comment.

It sounds to me like the claim you are making here is "the current AI Alignment paradigm might have a major hole, but also this hole might not be real".

I didn't write something like that because it is not what I meant. I gave an argument whose strength depends on other beliefs one has, and I just wanted to stress this fact. I also gave two examples (reported below), so I don't think I mentioned epistemic and moral uncertainty "in a somewhat handwavy way".

An example: if you think that futures shaped by malevolent actors using AI are many times more likely to happen than futures shaped by uncontrolled AI, the response will strike you as very important; and vice versa if you think the opposite.
Another example: if you think that extinction is way worse than dystopic futures lasting a long time, the response won't affect you much—assuming that bad human actors are not fans of complete extinction.

Maybe your scepticism is about my beliefs, i.e. you are saying that it is not clear, from the post, what my beliefs on the matter are. I think presenting the argument is more important than presenting my own beliefs: the argument can be used, or at least taken into consideration, by anone who is interested in these topics, while my beliefs alone are useless if they are not backed up by evidence and/or arguments. In case you are curious: I do believe futures shaped by uncontrolled AI are unlikely to happen.

Now to the last part of your comment:

I'm furthermore unsure why the solution to this proposed problem is to try and design AIs to make moral progress; this seems possible but not obvious. One problem with bad actors is that they often don't base their actions on what the philosophers think is good

I agree that bad actors won't care. Actually, I think that even if we do manage to build some kind of AI that is considered superethical (better than humans at ethical reasoning) by a decent amount of philosophers, very few people will care, especially at the beginning. But that doesn't mean it will be useless: at some point in the past, very few people believed slavery was bad, now it is a common belief. How much will such an AI accelerate moral progress, compared to other approaches? Hard to tell, but I wouldn't throw the idea in the bin.

7. Recall that alignment research motivated by the above points makes it easier to design AI that is controllable and whose goals are aligned with its designers' goals.
As a consequence, bad actors might have an easier time using powerfull controllable AI to achieve their goals. (From 4 and 6)
8. Thus, even though AI alignment research improves the expected value of futures caused by uncontrolled AI, it reduces the expected value of futures caused by bad human actors using controlled AI to achieve their ends. (From 5 and 7)
This conclusion will seem more, or less, relevant depending on the beliefs you have about its different components.