The importance of feedback
Consider trying to program a self-driving car to drive from San Francisco to Los Angeles -- with no sensors that allow it to gather information as it is driving. This is possible in principle. If you can predict the exact weather conditions, the exact movement of all of the other cars on the road, the exact amount of friction along every part of the road surface, the exact impact of (the equivalents of) pressing the gas or turning the steering wheel, and so on, then you could compute ahead of time how exactly to control the car such that it gets from SF to LA. Nevertheless, it seems unlikely that we will ever be able to accomplish such a feat, even with powerful AI systems.
No, in practice there is going to be some uncertainty about how the world is going to evolve; such that any plan computed ahead of time will have some errors that will compound over the course of the plan. The solution is to use sensors to gather information while executing the plan, so that we can notice any errors or deviations from the plan, and take corrective action. It is much easier to build a controller that keeps you pointed in the general direction, than to build a plan that will get you there perfectly without any adaptation.
Control theory studies these sorts of systems, and you can see the general power of feedback controllers in the theorems that can be proven. Especially for motion tasks, you can build feedback controllers that are guaranteed to safely achieve the goal, even in the presence of adversarial environmental forces (that are bounded in size, so you can’t have arbitrarily strong wind). In the presence of an adversary, in most environments it becomes impossible even in principle to make such a guarantee if you do not have any sensors or feedback and must compute a plan in advance. Typically, for every such plan, there is some environmental force that would cause it to fail.
The control theory perspective on AI alignment
With ambitious value learning, we’re hoping that we can learn a utility function that tells us the optimal thing to do into the future. You need to be able to encode exactly how to behave in all possible environments, no matter what new things happen in the future, even if it’s something we humans never considered a possibility so far.
This is analogous to the problem of trying to program a self-driving car. Just as in that case, we might hope that we can solve the problem by introducing sensors and feedback. In this case, the “feedback” would be human data that informs our AI system what we want it to do, that is, data that can be used to learn values. The evolution of human values and preferences in new environments with new technologies is analogous to the unpredictable environmental disturbances that control theory assumes.
This does not mean that an AI system must be architected in such a way that human data is explicitly used to “control” the AI every few timesteps in order to keep it on track. It does mean that any AI alignment proposal should have some method of incorporating information about what humans want in radically different circumstances. I have found this an important frame with which to view AI alignment proposals. For example, with indirect normativity or idealized humans it’s important that the idealized or simulated humans are going through similar experiences that real humans go through, so that they provide good feedback.
Feedback through interaction
Of course, while the control theory perspective does not require the feedback controller to be explicit, one good way to ensure that there is feedback would be to make it explicit. This would mean that we create an AI system that explicitly collects fresh data about what humans want in order to inform what it should do. This is basically calling for an AI system that is constantly using tools from narrow value learning to figure out what to do. In practice, this will require interaction between the AI and the human. However, there are still issues to think about:
Convergent instrumental subgoals: A simple way of implementing human-AI interaction would be to have an estimate of a reward function that is continually updated using narrow value learning. Whenever the AI needs to choose an action, it uses the current reward estimate to choose.
With this sort of setup, we still have the problem that we are maximizing a reward function which leads to convergent instrumental subgoals. In particular, the plan “disable the narrow value learning system” is likely very good according to the current estimate of the reward function, because it prevents the reward from changing causing all future actions to continue to optimize the current reward estimate.
Another way of seeing that this setup is a bit weird is that it has inconsistent preferences over time -- at any given point in time, it treats the expected change in its reward as an obstacle that should be undone if possible.
That said, it is worth noting that in this setup, the goal-directedness is coming from the human. In fact, any approach where goal-directedness comes from the human requires some form of human-AI interaction. We might hope that some system of this form allows us to have a human-AI system that is overall goal-directed (in order to achieve economic efficiency), while the AI system itself is not goal-directed, and so the overall system pursues the human’s instrumental subgoals. The next post will talk about reward uncertainty as a potential approach to get this behavior.
Humans are unable to give feedback: As our AI systems become more and more powerful, we might worry that they are able to vastly outthink us, such that they would need our feedback on scenarios that are too hard for us to comprehend.
On the one hand, if we’re actually in this scenario I feel quite optimistic: if the questions are so difficult that we can’t answer them, we’ve probably already solved all the simple parts of the reward, which means we’ve probably stopped x-risk.
But even if it is imperative that we answer these questions accurately, I’m still optimistic: as our AI systems become more powerful, we can have better AI-enabled tools that help us understand the questions on which we are supposed to give feedback. This could be AI systems that do cognitive work on our behalf, as in recursive reward modeling, or it could be AI-created technologies that make us more capable, such as brain enhancement or the ability to be uploaded and have bigger “brains” that can understand larger things.
Humans don’t know the goal: An important disanalogy between the control theory/self-driving car example and the AI alignment problem is that in control theory it is assumed that the general path to the destination is known, and we simply need to stay on it; whereas in AI alignment even the human does not know the goal (i.e. the “true human reward”). As a result, we cannot rely on humans to always provide adequate feedback; we also need to manage the process by which humans learn what they want. Concerns about human safety problems and manipulation fall into this bucket.
If I want an AI system that acts autonomously over a long period of time, but it isn't doing ambitious value learning (only narrow value learning), then we necessarily require a feedback mechanism that keeps the AI system "on track" (since my instrumental values will change over that period of time).
While the feedback mechanism need not be explicit (and could arise simply because it is an effective way to actually help me), we could consider AI designs that have an explicit feedback mechanism. There are still many problems with such a design, most notably that the obvious design has the problem that at any given point the AI system looks like it could be goal-directed with a long-term reward function, which is the sort of system that we are most worried about.
One small bit of such interaction could be rephrasing of commands.
Human: "I want apple."
Robot: "Do you want a computer or a fruit?"
Another way of interaction is presenting of the plan of actions, may be drawing it as a visual image:
Robot: "To give you an apple, I have to go to the shop, which will take at least one hour."
Human: "No, just find the apple in the refrigerator."
The third way is to confirm that human still want X after reasonable amounts of time, say, every one day:
Robot: "Yesterday you asked for an apple. Do you still want it?"
The forth is sending reports after every timestamp, which describes how the the project is going and which new subgoals appeared:
Robot: "There is no apples in the shop. I am going to another village, but it will take two more hours."
Human: "No, it is too long, buy me an orange."
The general intuition pomp for such interactions is relations between a human and ideal human secretary, and such pomp could be even used to train the robot. Again, this type of learning is possible only after the biggest part of AI safety is solved, or the robot will go foom after the first question.
I agree that all of these seem like good aspects of human-AI interaction to have, especially for narrow AI systems. For superhuman AI systems, there's a question of how much of this should the AI infer for itself vs. make sure to ask the human.
There is a problem of "moral unemployment" - that is, if superintelligent AI will do all hard work of analysing "what I should want", it will strip from me the last pleasant duty I may have.
E.g: Robot: "I know that the your deepest desire, which you may be not fully aware of, but after a lot of suffering, you will learn it for sure – is to write a novel. And I already wrote this novel for you - the best one which you could possibly write."
In the previous "Ambitious vs. narrow value learning" post, Paul Christiano characterized narrow value learning as learning "subgoals and instrumental values". From that post, I got the impression that ambitious vs narrow was about the scope of the task. However, in this post you suggest that ambitious vs narrow value learning is about the amount of feedback the algorithm requires. I think there is actually a 2x2 matrix of possible approaches here: we can imagine approaches which do or don't depend on feedback, and we can imagine approaches which try to learn all of my values or just some instrumental subset.
I think it depends on the details of the implementation:
We could construct the system's world model so human feedback is a special event that exists in a separate magesterium from the physical world, and it doesn't believe any action taken in the physical world could do anything to affect the type or quantity of human feedback that's given.
For redundancy, if the narrow value learning system is trying to learn how much humans approve of various actions, we can tell the system that the negative score from our disapproval of tampering with the value learning system outweighs any positive score it could achieve through tampering.
If the reward function weights rewards according to the certainty of the narrow value learning system that they are the correct reward, that creates incentives to keep the narrow value learning system operating, so the narrow value learning system can acquire greater certainty and provide a greater reward.
To elaborate a bit on the first two bullet points: It matters a lot whether the system thinks our approval is contingent on the physical configuration of the atoms in our brains. If the system thinks we will continue to disapprove of an action even after it's reconfigured our brain's atoms, that's what we want.
That wasn't exactly my point. My main point was that if we want an AI system that acts autonomously over a long period of time (think centuries), but it isn't doing ambitious value learning (only narrow value learning), then we necessarily require a feedback mechanism that keeps the AI system "on track" (since my instrumental values will change over that period of time). Will add a summary sentence to the post.
Agreed, I was imagining the "default" implementation (eg. as in this paper).
Something along these lines seems promising, I hadn't thought of this possibility before.
Yeah, uncertainty can definitely help get around this problem. (See also the next post, which should hopefully go up soon.)
Thanks for the reply! Looking forward to the next post!
When thinking about how a smarter-than-human AI would treat human input to close the control loop, it pays to consider the cases where humans are that smart intelligence. How do we close the loop when dealing with young children? primates/dolphins/magpies? dogs/cats? fish? insects? bacteria? In all these cases the apparent values/preferences of the "environment" are basically adversarial, something that must be taken into account, but definitely not obeyed. In the original setup a super-intelligent aligned AI's actions would be incomprehensible to us, no matter how much it would try to explain them to us (go explain to a baby that eating all the chocolate it wants is not a good idea, or to a cat that their favorite window must remain closed). Again, in the original setup it can be as drastic as an AI culling the human population, to help save us from a worse fate, etc. Sadly, this is not far from the "God works in mysterious ways" excuse one hears as a universal answer to the questions of theodicy.
The problem with this line of reasoning is that it assumes that the goal-directness comes from the smarter part of the duo decision-maker and bearer of consequences. With children and animals we consider they preferences as an input into our decision making, which mainly seeks to satisfies our preferences. We do not raise children solely for the purpose of satisfying their preferences.
This is why Rohin stresses particuarly on the idea that the danger in is the source of goal-directedness and if it comes from humans, then we are safer.
We raise children to satisfy their expected well being, not their naive preferences (for chocolate and toys), and that seems similar to what a smarter-than-human AI would do to/for us. Which was my point.
I think we raise children to satisfy our common expected wellbeing (our + theirs + the overall societal one). Thus, the goal-directness comes from society as a whole. I think there is a key difference between this system and one where a a smarter-than-human AI focuses solely on the well-being of its users, even if it does Context Etrapolated Volition, which I think is what you are referring to when you talk about expected well being (which I agree that if you look only at their CEV-like property the two systems are equivalent).