On value in humans, other animals, and AI

[-]TurnTrout3yΩ220

At the same time, I don’t think that people who write papers about RL are off-track: I consider AIXI to be a good mathematical abstraction of many different RL algorithms, convergence theorems are valid for these algorithms, and thinking of RL in terms of reward maximisation doesn’t seem particularly misleading to me.

Do you have concrete examples of where convergence theorems apply to an interesting task with e.g. PPO? "There aren't interesting examples like this which are alignment-relevant" seems like an important belief of mine, so if you know a counterexample, I'd be very grateful to learn about it and change my mind!

[-]Michele Campolo3yΩ330

I might be misunderstanding you: take this with a grain of salt.

From my perspective: if convergence theorems did not work to a reasonable degree in practice, nobody would use RL-related algorithms. If I set reward in place A, but by default agents end up going somewhere far away from A, my approach is not doing what it is supposed to do; I put reward in place A because I wanted an agent that would go towards A to a certain extent.

I am not familiar with PPO. From this short article, in the section about TRPO:

Recall that due to approximations, theoretical guarantees no longer hold.

Is this what you are referring to? But is it important for alignment? Let's say the conditions for convergence are not met anymore, the theorem can't be applied in theory, but in practice I do get an agent that goes towards A, where I've put reward. Is it misleading to say that the agent is maximising reward?

(However, keep in mind that

I agree with Turner that modelling humans as simple reward maximisers is inappropriate

)

If you could unpack your belief "There aren't interesting examples like this which are alignment-relevant", I might be able to give a more precise/appropriate reply.

[-]TurnTrout3yΩ330

My overall contention is that viewing deep RL as an optimization problem where the RL alg tries to find a policy which tries to maximize reward over time, is fatally flawed and misunderstands the point and practicalities of real RL algorithms.^[1]

Is it misleading to say that the agent is maximising reward?

I think it can be misleading, but it depends. "The network implements a policy which reliably reaches A" -- that is what we observe. We can also state "this achieves a high numerical score on the policy-gradient-intensity (aka 'reward') statistic." These statements are true and not very misleading, IMO. It doesn't push me to consider unwarranted hypotheses about its generalization behavior, like that it cares about reward or will try to make reward come out high in general. These statements instead draw my attention towards our observations -- "I wonder how the policy is internally structured so as to reliably reach A?" is a very fruitful question IMO.

One related problem is that RL papers often repeat "the point of RL is to train agents to maximize reward", which leads to really bad implicit models of how RL works in my experience. I think it leads people to privilege the hypothesis that RL agents will optimize their own reward signal in some way, shape, or form.

If you could unpack your belief "There aren't interesting examples like this which are alignment-relevant", I might be able to give a more precise/appropriate reply.

Let's consider one obstacle.

Convergence theorems require certain learning rate and state visitation schedules (agents don't visit every state of the world infinitely many times, in real life) which can themselves be interrupted if an AI eg sets its own learning rate to zero (thus violating the theorem's preconditions). As best I can fathom, the convergence theorems do not apply to situations we care about (e.g. an embodied agent which we want to manufacture diamonds for us), for generalizable reasons which won't be fixed through more clever proof techniques: EG an agent which tried to visit every state infinitely many times would quickly die and stop exploring.

Or in say the LLM-finetuning case, where you're doing RLHF to get the model to (hopefully!) help you brainstorm research topics, the agent simply won't try out every token-sequence in its context window. That won't happen a single time, let alone infinitely many times. Even finite-time guarantees won't kick in in time to apply to reality.

Or even if exploration weren't an issue, the agent could -- as i mentioned -- simply set its learning rate to zero. How can the theorems help us there?

Again, as best I can fathom -- there's no clever argument or proof strategy that gets you around that obstacle, if you want to just apply the standard results to agents which can die or which operate on reasonable timescales or which can modify the learning rate schedule we nominally set.

(And then we can also talk about expressivity issues, learning dynamics being nonstationary over time, etc. The learning rate/state visitation obstacle is a sufficient blocker for convergence theorems IRL, but not itself a crux for me, in that I still wouldn't expect you can apply the theorems even if the LR "issue" vanished.)

^{^}
See more writing on this: Reward is not the optimization target and Models Don't "Get Reward".

[-]beren3yΩ110

My understanding is that after a lot of simplifications, policy gradients just takes a noisy gradient step in the direction of minimising Bellman error, and so in the limit of infinite data/computation/visiting all states in the world, it is 'guaranteed' to converge to an optimal policy for the MDP. Q learning and other model-free algorithms have similar guarantees. In practice, with function approximation, and PPOs regularisation bits, these guarantees do not hold anymore, but the fundamental RL they are built off of does have them. The place to go deeper into this is Sutton and Bart's textbook and also Bertsekas' dynamic programming textbook

[-]TurnTrout3yΩ220

Yeah, I've read those books, although I admit to heavily skimming Bertsekas.

[-]Ben Amitay3yΩ220

I think that the reason no one in the field try to create ai that critically reflect on its values is that most of us, more or less explicitly, are not moral realists. My prediction for what the conclusion would be of an ai criticality asking itself what is worth doing is "that question don't make any sense. Let me replace it with 'what I want to do' or some equivalent". Or at best "that question don't make any sense. raise ValueError('pun intended')"

[-]Michele Campolo3yΩ110

Sorry for the late reply, I missed your comment.

Yeah I get it, probably some moral antirealists think this approach to alignment does not make a lot of sense. I think they are wrong, though. My best guess is that an AI reflecting on what is worth doing will not think something like "the question does not make any sense", but rather it will be morally (maybe also meta-morally) uncertain. And the conclusions it eventually reaches will depend on the learning algorithm, the training environment, initial biases, etc.

[-]Ben Amitay3yΩ110

Thanks for the reply.

To make sure that I understand your position: are you a realist, and what do you think is the meaning of moral facts? (I'm not an error theorist but something like "meta-error theorist" - think that people do try to claim something, but not sure how that thing could map to external reality. )

Then the next question, that will be highly relevant to the research that you propose, is how do you think you know those facts if you do? (Or more generally, what is the actual work of reflecting on your values?)

[-]Michele Campolo3yΩ110

If I had to pick one between the two labels 'moral realism' and 'moral anti-realism' I would definitely choose realism.

I am not sure about how to reply to "what is the meaning of moral facts": it seems too philosophical, in the sense that I don't get what you want to know in practice. Regarding the last question: I reason about ethics and morality by using similar cognitive skills to the ones I use in order to know and reason about other stuff in the world. This paragraph might help:

It also helps explain how we get to discriminate between goals such as increasing world happiness and increasing world suffering, mentioned in the introduction. From our frequent experiences of pleasure and pain, we categorise many things as ‘good (or bad) for me’; then, through a mix of empathy, generalisation, and reflection, we get to the concept of ‘good (or bad) for others’, which comes up in our minds so often that the difference between the two goals strikes us as evident and influences our behaviour (towards increasing world happiness rather than world suffering, hopefully).

I do not have a clear idea yet of how this happens algorithmically, but an important factor seems to be that, in the human mind, goals and actions are not completely separate, and neither are action selection and goal selection. When we think about what to do, sometimes we do fix a goal and plan only for that, but other times the question becomes about what is worth doing in general, what is best, what is valuable: instead of fixing a goal and choosing an action, it's like we are choosing between goals.

[-]Ben Amitay3yΩ110

I meant the first question in a very pragmatic way: what is it that you are trying to say when you say that something is good? What information does it represent?

It would be clearer in analogy to factual claims: we can do lots of philosophy about the exact meaning of saying that I have a dog, but in the end we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result. Every reflection and generalization that we do is ultimately about that, and can achieve something meaningful because of that.

I do not see the analogous story for moral reflection.

[-]Michele Campolo3yΩ110

we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result.

Besides the sentence 'check whether there is a dog in my house', it seems ok to me to replace the word 'dog' with the word 'good' or 'bad' in the above paragraph. Agreement might be less easy to achieve, but it doesn't mean finding a common ground is impossible.

For example, some researchers classify emotions according to valence, i.e. whether it is an overall good or bad experience for the experiencer, and in the future we might be able to find a map from brain states to whether a person is feeling good or bad. In this sense of good and bad, I'm pretty sure that moral philosophers who argue for the maximisation of bad feelings for the largest amount of experiencers are a very small minority. In other terms, we agree that maximising negative valence on a large scale is not worth doing.

(Personally, however, I am not a fan of arguments based on agreement or disagreement, especially in the moral domain. Many people in the past used to think that slavery was ok: does it mean slavery was good and right in the past, while now it is bad and wrong? No, I'd say that normally we use the words good/bad/right/wrong in a different way, to mean something else; similarly, we don't normally use the word 'dog' to mean e.g. 'wolf'. From a different domain: there is disagreement in modern physics about some aspects of quantum mechanics. Does it mean quantum mechanics is fake / not real / a matter of subjective opinion? I don't think so)

[-]Ben Amitay3yΩ110

Let me clarify that I don't argue from agreement per say. I care about the underlying epistemic mechanism of agreement, that I claim to also be the mechanism of correctness. My point is that I don't see similar epistemic mechanism in the case of morality.

Of course, emotions are verifiable states of brains. And the same goes for preferring actions that would lead to certain emotions and not others. It is a verifiable fact that you like chocolate. It is a contingent property of my brain that I care, but I don't see what sort of argument that it is correct for me too care could even in principle be inherntly compelling.

[-]Michele Campolo3yΩ110

I don't know what passes your test of 'in principle be an inherently compelling argument'. It's a toy example, but here are some steps that to me seem logical / rational / coherent / right / sensible / correct:

X is a state of mind that feels bad to whatever mind experiences it (this is the starting assumption, it seems we agree that such an X exists, or at least something similar to X)
X, experienced on a large scale by many minds, is bad
Causing X on a large scale is bad
When considering what to do, I'll discard actions that cause X, and choose other options instead.

Now, some people will object and say that there are holes in this chain of reasoning, i.e. that 2 doesn't logically follow from 1, or 3 doesn't follow from 2, or 4 doesn't follow from 3. For the sake of this discussion, let's say that you object the step from 1 to 2. Then, what about this replacement:

X is a state of mind that feels bad to whatever mind experiences it [identical to original 1]
X, experienced on a large scale by many minds, is good [replaced 'bad' with 'good']

Does this passage from 1 to 2 seems, to you (our hypothetical objector), equally logical / rational / coherent / right / sensible / correct as the original step from 1 to 2? Could I replace 'bad' with basically anything, and the correctness would not change at all as a result?

My point is that, to many reflecting minds, the replacement seems less logical / rational / coherent / right / sensible / correct than the original step. And this is what I care about for my research: I want an AI that reflects in a similar way, an AI to which the original steps do seem rational and sensible, while replacements like the one I gave do not.

[-]Ben Amitay3yΩ11-1

That was good for my understanding of your position. My main problem with the whole thing though is in the use the word "bad". I think it should be taboo at least until we establish a shared meaning.

Specifically, I think that most observers will find the first argument more logical than the second, because of a fallacy in using the word "bad". I think that we learn that word in a way that is deeply entangled with power reward mechanism, to the point that it is mostly just a pointer to negative reward, things that we want to avoid, things that made our parents angry... In my view, the argument is than basically:

I want to avoid my suffering, and now generally person p want to avoid person p suffering. Therfore suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for", therefore avoid creating suffering.

When written that way, it doesn't seem more logical than is opposite.

[-]Michele Campolo3yΩ110

To a kid, 'bad things' and 'things my parents don't want me to do' overlap to a large degree. This is not true for many adults. This is probably why the step

suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for"

seems weak.

Overall, what is the intention behind your comments? Are you trying to understand my position even better, and if so, why? Are you interested in funding this kind of research; or are you looking for opportunities to change your mind; or are you trying to change my mind?

[-]Ben Amitay3yΩ110

Since I became reasonably sure that I understand your position and reasoning - mostly changing it.

[-]AlexSergiev3y00

Your article has inspired me to explore the potential of AI further in my work. As an animated video maker, I appreciate the practical ways AI can enhance our workflows, improve the quality of our projects, and provide cost-effective solutions. Thank you for the insightful article on the benefits of using AI in 3D character animation studio.

LESSWRONG
LW

LESSWRONG
LW

3

On value in humans, other animals, and AI

3

Ω 3

3

Ω 3

Introduction

Key points

Key point 1:

Key point 2:

Key point 3:

Differences with animals and AI

Animals

AI

Appendix: quick comparison with shard theory