Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This will be posted also on the EA Forum, and included in a sequence containing some previous posts and other posts I'll publish this year.

Introduction

Humans think critically about values and, to a certain extent, they also act according to their values. To the average human, the difference between increasing world happiness and increasing world suffering is huge and evident, while goals such as collecting coins and collecting stamps are roughly on the same level.

It would be nice to make these differences obvious to AI as they are to us. Even though exactly copying what happens in the human mind is probably not the best strategy to design an AI that understands ethics, having an idea of how value works in humans is a good starting point.

So, how do humans reason about values and act accordingly?

Key points

Let’s take a step back and start from sensation. Through the senses, information goes from the body and the external environment to our mind.

After some brain processing — assuming we’ve had enough experiences of the appropriate kind —  we perceive the world as made of objects. A rock is perceived as distinct from its surrounding environment because of its edges, its colour, its weight, the fact that my body can move through air but not through rocks, and so on.

Objects in our mind can be combined with each other to form new objects. After seeing various rocks in different contexts, I can imagine a scene in which all these rocks are in front of me, even though I haven’t actually seen that scene before.

We are also able to apply our general intelligence — think of skills such as categorisation, abstraction, induction — to our mental content.

Other intelligent animals do something similar. They probably understand that, to satisfy thirst, water in a small pond is not that different from water flowing in a river. However, an important difference is that animals’ mental content is more constrained than our mental mental content: we are less limited by what we perceive in the present moment, and we are also better at combining mental objects with each other.

For example, to a dog, its owner works as an object in the dog’s mind, while many of its owner’s beliefs do not. Some animals can attribute simple intentions and perception, e.g. they understand what a similar animal can and cannot see, but it seems they have trouble attributing more complex beliefs.

The ability to compose mental content in many different ways is what allows us to form abstract ideas such as mathematics, religion, and ethics, just to name a few.

Key point 1:

In humans, mental content can be abstract.


Now notice that some mental content drives immediate action and planning. If I feel very hungry, I will do something about it, in most cases.

This process from mental content to action doesn’t have to be entirely conscious. I can instinctively reach for the glass of water in front of me as a response to an internal sensation, even without moving my attention to the sensation nor realising it is thirst.

Key point 2:

Some mental content drives behaviour.


Not all mental content drives action and planning. The perception of an obstacle in front of me might change how I carry out my plans and actions, but it is unlikely to change what I plan and act for. Conversely, being very hungry directly influences what I’m going to do — not just how I do it — and can temporarily override other drives. It is in this latter sense that some mental content drives behaviour.

In humans, the mental content that does drive behaviour can be roughly split in two categories.

The first one groups what we often call evolutionary or innate drives, like hunger and thirst in the examples above, and works similarly in other animals. It is mostly fixed, in the sense that unless I make drastic changes to my body or mind, I will keep perceiving how hungry I am and this will influence my behaviour virtually each day of my life.

The second category is about what we recognise as valuable, worth doing, better than possible alternatives, or simply good. This kind of drive is significantly less fixed than the first category: what we consider valuable may change after we reflect on it in context with our other beliefs, or as a consequence of life experiences.

Some examples will help clarify this. Think of a philosopher who adjusts her beliefs about value as she learns and reflects more about ethics, and then takes action in line with her new views. Or consider a turned atheist, who has stopped placing value on religion and praying because he now sees the concept of god as inconsistent with everything else he knows about the world.

This second category of mental content that drives behaviour is not only about ethical or abstract beliefs. A mundane example might be more illustrative: someone writes down a shopping list after an assessment of what seems worth buying at that moment, then proceeds with the actual shopping. In this case, the influence of deliberation on future action is straightforward. 

Key point 3:

In humans, part of the mental content that drives behaviour changes with experience and reflection.

This last point clarifies some of the processes underlying the apparently simple statement that ‘we act according to our values’.

It also helps explain how we get to discriminate between goals such as increasing world happiness and increasing world suffering, mentioned in the introduction. From our frequent experiences of pleasure and pain, we categorise many things as ‘good (or bad) for me’; then, through a mix of empathy, generalisation, and reflection, we get to the concept of ‘good (or bad) for others’, which comes up in our minds so often that the difference between the two goals strikes us as evident and influences our behaviour (towards increasing world happiness rather than world suffering, hopefully). 

Differences with animals and AI

Animals

Point 3 is fundamental to human behaviour. Together with point 1, it explains why some of our actions have motives that are quite abstract and not immediately reducible to evolutionary drives. In contrast, the behaviour of other animals is more grounded in perception, and is well explained even without recurring to reflection or an abstract concept of value.

AI

Point 3 is also a critical difference between humans and current AI systems. Even though AIs are getting better and better at learning – thus, in a sense, their behaviour changes with experience – their tasks are still chosen by their designers, programmers, or users, not by each AI through a process of reflection.

This shouldn't be surprising: in a sense, we want AIs to do what we want, not what they want. At the same time, I think that connecting action to reflection in AI will, with enough research and experiments, allow us to get AI that thinks critically about values and sees the world through lenses similar to ours.

In a future post I’ll briefly go through the (lack of) research related to AI that reflects on what is valuable and worth doing. I’ll also give some ideas about how to write an algorithm of an agent that reflects.

Appendix: quick comparison with shard theory

As far as I understand, shard theory is still a work in progress; in this comparison I’ll focus just on some interesting ideas I’ve read in Reward is not the optimization target.

In a nutshell, Alex Turner sees humans as reinforcement learning (RL) agents, but makes the point that reward does not work like many people in the field of RL think it works. Turner writes that “reward is not, in general, that-which-is-optimized by RL agents”; many RL agents do not act as reward maximisers in the real world. Rather, reward imposes a reinforcement schedule that shapes the agent’s cognition, by e.g. reinforcing thoughts and/or computations in a context, so that in the future they will be more likely to happen in a similar enough context.

I agree with Turner that modelling humans as simple reward maximisers is inappropriate, in line with everything I’ve written in this post. At the same time, I don’t think that people who write papers about RL are off-track: I consider AIXI to be a good mathematical abstraction of many different RL algorithms, convergence theorems are valid for these algorithms, and thinking of RL in terms of reward maximisation doesn’t seem particularly misleading to me.

Thus, I would solve this puzzle about human values, reward, and RL not by revisiting the relation between reward and RL algorithms, but by avoiding the equation between humans and RL agents. RL, by itself, doesn’t seem a good model of what humans do. If asked why humans do not wirehead, I would reply that it’s because what we consider valuable and worth doing competes with other drives in action selection, not by saying that humans are RL agents but reward works differently from how RL academics think it works.  

Having said that, I still find many ideas in Reward is not the optimization target really interesting and instructive, e.g. that reward acts as a reinforcement schedule. It’s probably among the most thought-provoking posts I’ve read on the Alignment Forum.

 

This work was supported by CEEALAR and by an anonymous donor.

Thanks to Nicholas Dupuis for many useful comments on a draft.

New Comment
17 comments, sorted by Click to highlight new comments since: Today at 7:09 AM

At the same time, I don’t think that people who write papers about RL are off-track: I consider AIXI to be a good mathematical abstraction of many different RL algorithms, convergence theorems are valid for these algorithms, and thinking of RL in terms of reward maximisation doesn’t seem particularly misleading to me.

Do you have concrete examples of where convergence theorems apply to an interesting task with e.g. PPO? "There aren't interesting examples like this which are alignment-relevant" seems like an important belief of mine, so if you know a counterexample, I'd be very grateful to learn about it and change my mind!

I might be misunderstanding you: take this with a grain of salt.

From my perspective: if convergence theorems did not work to a reasonable degree in practice, nobody would use RL-related algorithms. If I set reward in place A, but by default agents end up going somewhere far away from A, my approach is not doing what it is supposed to do; I put reward in place A because I wanted an agent that would go towards A to a certain extent.

I am not familiar with PPO. From this short article, in the section about TRPO:

Recall that due to approximations, theoretical guarantees no longer hold.

Is this what you are referring to? But is it important for alignment? Let's say the conditions for convergence are not met anymore, the theorem can't be applied in theory, but in practice I do get an agent that goes towards A, where I've put reward. Is it misleading to say that the agent is maximising reward?

(However, keep in mind that

I agree with Turner that modelling humans as simple reward maximisers is inappropriate

)

If you could unpack your belief "There aren't interesting examples like this which are alignment-relevant", I might be able to give a more precise/appropriate reply.

My overall contention is that viewing deep RL as an optimization problem where the RL alg tries to find a policy which tries to maximize reward over time, is fatally flawed and misunderstands the point and practicalities of real RL algorithms.[1]

Is it misleading to say that the agent is maximising reward?

I think it can be misleading, but it depends. "The network implements a policy which reliably reaches A" -- that is what we observe. We can also state "this achieves a high numerical score on the policy-gradient-intensity (aka 'reward') statistic." These statements are true and not very misleading, IMO. It doesn't push me to consider unwarranted hypotheses about its generalization behavior, like that it cares about reward or will try to make reward come out high in general. These statements instead draw my attention towards our observations -- "I wonder how the policy is internally structured so as to reliably reach A?" is a very fruitful question IMO.

One related problem is that RL papers often repeat "the point of RL is to train agents to maximize reward", which leads to really bad implicit models of how RL works in my experience. I think it leads people to privilege the hypothesis that RL agents will optimize their own reward signal in some way, shape, or form.

If you could unpack your belief "There aren't interesting examples like this which are alignment-relevant", I might be able to give a more precise/appropriate reply.

Let's consider one obstacle.

Convergence theorems require certain learning rate and state visitation schedules (agents don't visit every state of the world infinitely many times, in real life) which can themselves be interrupted if an AI eg sets its own learning rate to zero (thus violating the theorem's preconditions). As best I can fathom, the convergence theorems do not apply to situations we care about (e.g. an embodied agent which we want to manufacture diamonds for us), for generalizable reasons which won't be fixed through more clever proof techniques: EG an agent which tried to visit every state infinitely many times would quickly die and stop exploring. 

Or in say the LLM-finetuning case, where you're doing RLHF to get the model to (hopefully!) help you brainstorm research topics, the agent simply won't try out every token-sequence in its context window. That won't happen a single time, let alone infinitely many times. Even finite-time guarantees won't kick in in time to apply to reality.

Or even if exploration weren't an issue, the agent could -- as i mentioned -- simply set its learning rate to zero. How can the theorems help us there?

Again, as best I can fathom -- there's no clever argument or proof strategy that gets you around that obstacle, if you want to just apply the standard results to agents which can die or which operate on reasonable timescales or which can modify the learning rate schedule we nominally set. 

(And then we can also talk about expressivity issues, learning dynamics being nonstationary over time, etc. The learning rate/state visitation obstacle is a sufficient blocker for convergence theorems IRL, but not itself a crux for me, in that I still wouldn't expect you can apply the theorems even if the LR "issue" vanished.)

  1. ^

My understanding is that after a lot of simplifications, policy gradients just takes a noisy gradient step in the direction of minimising Bellman error, and so in the limit of infinite data/computation/visiting all states in the world, it is 'guaranteed' to converge to an optimal policy for the MDP. Q learning and other model-free algorithms have similar guarantees. In practice, with function approximation, and PPOs regularisation bits, these guarantees do not hold anymore, but the fundamental RL they are built off of does have them. The place to go deeper into this is Sutton and Bart's textbook and also Bertsekas' dynamic programming textbook

Yeah, I've read those books, although I admit to heavily skimming Bertsekas. 

I think that the reason no one in the field try to create ai that critically reflect on its values is that most of us, more or less explicitly, are not moral realists. My prediction for what the conclusion would be of an ai criticality asking itself what is worth doing is "that question don't make any sense. Let me replace it with 'what I want to do' or some equivalent". Or at best "that question don't make any sense. raise ValueError('pun intended')"

Sorry for the late reply, I missed your comment.

Yeah I get it, probably some moral antirealists think this approach to alignment does not make a lot of sense. I think they are wrong, though. My best guess is that an AI reflecting on what is worth doing will not think something like "the question does not make any sense", but rather it will be morally (maybe also meta-morally) uncertain. And the conclusions it eventually reaches will depend on the learning algorithm, the training environment, initial biases, etc.

Thanks for the reply.

To make sure that I understand your position: are you a realist, and what do you think is the meaning of moral facts? (I'm not an error theorist but something like "meta-error theorist" - think that people do try to claim something, but not sure how that thing could map to external reality. )

Then the next question, that will be highly relevant to the research that you propose, is how do you think you know those facts if you do? (Or more generally, what is the actual work of reflecting on your values?)

If I had to pick one between the two labels 'moral realism' and 'moral anti-realism' I would definitely choose realism.

I am not sure about how to reply to "what is the meaning of moral facts": it seems too philosophical, in the sense that I don't get what you want to know in practice. Regarding the last question: I reason about ethics and morality by using similar cognitive skills to the ones I use in order to know and reason about other stuff in the world. This paragraph might help:

It also helps explain how we get to discriminate between goals such as increasing world happiness and increasing world suffering, mentioned in the introduction. From our frequent experiences of pleasure and pain, we categorise many things as ‘good (or bad) for me’; then, through a mix of empathy, generalisation, and reflection, we get to the concept of ‘good (or bad) for others’, which comes up in our minds so often that the difference between the two goals strikes us as evident and influences our behaviour (towards increasing world happiness rather than world suffering, hopefully).

I do not have a clear idea yet of how this happens algorithmically, but an important factor seems to be that, in the human mind, goals and actions are not completely separate, and neither are action selection and goal selection. When we think about what to do, sometimes we do fix a goal and plan only for that, but other times the question becomes about what is worth doing in general, what is best, what is valuable: instead of fixing a goal and choosing an action, it's like we are choosing between goals.

I meant the first question in a very pragmatic way: what is it that you are trying to say when you say that something is good? What information does it represent?

It would be clearer in analogy to factual claims: we can do lots of philosophy about the exact meaning of saying that I have a dog, but in the end we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result. Every reflection and generalization that we do is ultimately about that, and can achieve something meaningful because of that.

I do not see the analogous story for moral reflection.

we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result.

Besides the sentence 'check whether there is a dog in my house', it seems ok to me to replace the word 'dog' with the word 'good' or 'bad' in the above paragraph. Agreement might be less easy to achieve, but it doesn't mean finding a common ground is impossible.

For example, some researchers classify emotions according to valence, i.e. whether it is an overall good or bad experience for the experiencer, and in the future we might be able to find a map from brain states to whether a person is feeling good or bad. In this sense of good and bad, I'm pretty sure that moral philosophers who argue for the maximisation of bad feelings for the largest amount of experiencers are a very small minority. In other terms, we agree that maximising negative valence on a large scale is not worth doing.

(Personally, however, I am not a fan of arguments based on agreement or disagreement, especially in the moral domain. Many people in the past used to think that slavery was ok: does it mean slavery was good and right in the past, while now it is bad and wrong? No, I'd say that normally we use the words good/bad/right/wrong in a different way, to mean something else; similarly, we don't normally use the word 'dog' to mean e.g. 'wolf'. From a different domain: there is disagreement in modern physics about some aspects of quantum mechanics. Does it mean quantum mechanics is fake / not real / a matter of subjective opinion? I don't think so)

Let me clarify that I don't argue from agreement per say. I care about the underlying epistemic mechanism of agreement, that I claim to also be the mechanism of correctness. My point is that I don't see similar epistemic mechanism in the case of morality.

Of course, emotions are verifiable states of brains. And the same goes for preferring actions that would lead to certain emotions and not others. It is a verifiable fact that you like chocolate. It is a contingent property of my brain that I care, but I don't see what sort of argument that it is correct for me too care could even in principle be inherntly compelling.

I don't know what passes your test of 'in principle be an inherently compelling argument'. It's a toy example, but here are some steps that to me seem logical / rational / coherent / right / sensible / correct:

  1. X is a state of mind that feels bad to whatever mind experiences it (this is the starting assumption, it seems we agree that such an X exists, or at least something similar to X)
  2. X, experienced on a large scale by many minds, is bad
  3. Causing X on a large scale is bad
  4. When considering what to do, I'll discard actions that cause X, and choose other options instead.

Now, some people will object and say that there are holes in this chain of reasoning, i.e. that 2 doesn't logically follow from 1, or 3 doesn't follow from 2, or 4 doesn't follow from 3. For the sake of this discussion, let's say that you object the step from 1 to 2. Then, what about this replacement:

  1. X is a state of mind that feels bad to whatever mind experiences it [identical to original 1]
  2. X, experienced on a large scale by many minds, is good [replaced 'bad' with 'good']

Does this passage from 1 to 2 seems, to you (our hypothetical objector), equally logical / rational / coherent / right / sensible / correct as the original step from 1 to 2? Could I replace 'bad' with basically anything, and the correctness would not change at all as a result?

My point is that, to many reflecting minds, the replacement seems less logical / rational / coherent / right / sensible / correct than the original step. And this is what I care about for my research: I want an AI that reflects in a similar way, an AI to which the original steps do seem rational and sensible, while replacements like the one I gave do not.

That was good for my understanding of your position. My main problem with the whole thing though is in the use the word "bad". I think it should be taboo at least until we establish a shared meaning.

Specifically, I think that most observers will find the first argument more logical than the second, because of a fallacy in using the word "bad". I think that we learn that word in a way that is deeply entangled with power reward mechanism, to the point that it is mostly just a pointer to negative reward, things that we want to avoid, things that made our parents angry... In my view, the argument is than basically:

I want to avoid my suffering, and now generally person p want to avoid person p suffering. Therfore suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for", therefore avoid creating suffering.

When written that way, it doesn't seem more logical than is opposite.

To a kid, 'bad things' and 'things my parents don't want me to do' overlap to a large degree. This is not true for many adults. This is probably why the step

suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for"

seems weak.

Overall, what is the intention behind your comments? Are you trying to understand my position even better,  and if so, why? Are you interested in funding this kind of research; or are you looking for opportunities to change your mind; or are you trying to change my mind?

Since I became reasonably sure that I understand your position and reasoning - mostly changing it.

Your article has inspired me to explore the potential of AI further in my work. As an animated video maker, I appreciate the practical ways AI can enhance our workflows, improve the quality of our projects, and provide cost-effective solutions. Thank you for the insightful article on the benefits of using AI in 3D character animation studio.