Meta-preferences are weird

Charlie Steiner

LESSWRONG
is fundraising!
LW

Meta-preferences are weird — LessWrong

(Talk given at an event on Sunday 28th of June. Charlie Steiner is responsible for the talk, David Lambert edited the transcript.

If you're a curated author and interested in giving a 5-min talk, which will then be transcribed and edited, sign up here.)

Charlie Steiner: My talk is called Meta-Preferences Are Weird. It is something I have had on my mind recently. I figured I would share my thoughts on describing preferences. And trying to imagine what it would be like to have something that not only learns our preferences, but also respects this idea of meta-preferences — that is, how we would like to learn, and grow, and what we want to be modelled after.

Smoking is a very clear-cut example of meta-preferences. If we think of a smoker as having a meta-preference against smoking, we can cash this out in a couple of different ways, because preferences aren't really written in Fortran on the inside of our skulls. They're just a pattern that we use to predict things and talk about humans including ourselves.

So when we say a smoker wants to quit smoking, it might mean that they would take a pill that stops them from smoking if you offered it to them. Or it might mean that they have certain emotions and ways of talking that are related to wanting to quit smoking. Or it might mean that, if you are training a superhuman AI, the smoker would try to change the AI's code so that it doesn't consider smoking as one of the real preferences.

These actions don't have a shared essence, but they have a shared pattern. So meta-preferences are just a subset of preferences which are, in turn, this pattern in our behaviour and our thoughts.

If I take an action that changes my preferences (like growing as a human being and developing empathy) this will also probably change my meta-preferences, and what sort of actions I will take that will impact my future preferences.

So maybe there's a slippery slope effect where once you go part of the way, you're more likely to go further. But there could also be non-linear effects where, for example, I make a change that increases my intelligence like reading a textbook. And this changes my meta-preferences in a relatively unrelated way, which I won't go into because that's a long digression.

The metaphor that I find really compelling, but I'm not sure if it's the right metaphor, is of an optimization landscape. If we imagine this process of growth and self-reflection and self-improvement and possibly self-destruction as flowing in a optimization landscape, then it's really easy to visualize things in two dimensions.

You can visualize this question. If I am flowing downhill like water do I end up at a nearby lake? This is what we might imagine as something still human, but perhaps with some updated preferences and maybe a little more consistency.

Or does this process diverge? Do I eventually flow down to the Mississippi River and then out into the ocean? In this case, the ocean is some boring state that doesn't reflect much of my origin. It loses the information that I currently contain and instead is a basin where lots of different agents might end up.

It's an attractor that attracts too much of the space to be interesting. This notion of flow is really interesting, and it's compelling to try to imagine doing value learning, and doing this artificial or programmatic updating on meta-preferences in a way that looks like this flow.

But in reality, our actions that affect our own preferences might be discontinuous.

If we have discontinuous jumps, then it messes up this notion of nearby attractors being good and far away attractors being bad. But it's plausible that there is some way of thinking about this using a high-dimensional representation like the abstract features in deep learning where you can smoothly transition one face into another face, with all points in between still being recognizable faces.

And using a high-dimensional representation might sow the jumps back together so that it restores some notion of flow under meta-preferences. That's the weird thing.

Questions

Ben Pace: Thank you very much. That's a brilliant visualization of changing your humanity while keeping your humanity. There's a deep neural network changing a face to be a totally different person whilst, at each point, still being a person. You should use that in a blog post at some point.

I'm trying to get a more concrete handle on times when I have meta-preferences. I feel like other than very direct ones where I have addictions with, like straightforward desires that I don't want, I feel like there's also a lot of times where I want to change the sort of person that I am.

I'm like, "Oh, I wish I was more of that sort of person”, and that can have a lot of surprising knock-on effects in ways that I didn't expect. You can try to be more of an agreeable person or you can try and be more of a disagreeable person. And I think this has a lot of surprising effects on your values.

Sometimes, if you don't want it, you notice and course-correct in time. But sometimes you don't. But I don’t have a good theory of exactly how to notice when these sorts of things will have surprising knock-on effects. Ruben Bloom, would you like to ask your question?

Ruben Bloom: Yes. I'm curious if any of these problems of meta-preferences could be approached from an angle of multi-agents, the way that Kaj Sotala tends to write about this. You said, "Okay, I've got this smaller sub-agent of me who likes smoking, and this other sub-agent who doesn't like it," and so forth.

Have you thought about that lens and whether that's comparable with what you're thinking about? And if so, whether it is a good way to think about it?

Charlie Steiner: No. I haven't thought about it like that., You could imagine a parliamentary model where you think of this as continuously changing the vote share allocated to different sub-agents. But again, in reality, there might be discontinuous jumps. So it is probably not going to be the way I end of thinking about it.

Ben Pace: Thanks a lot. Ruby, did you want to follow back on that or are you good?

Ruben Bloom: The only thing is that I think I would expect it to end up being an interesting equivalence with the different formulas you can put on it. One would be like "Everything is like multiple agents being piled together." And this other thing like, "No, we just have a single entity”. And either frame works.

Ben Pace: Thanks Ruby. Daniel, do you want to ask your question?

Daniel Kokotajlo: The issue of discontinuous jumps doesn't seem like it's going to ultimately be a problem for you.. It seems like it meshes with a nice technical definition of distance that you were hoping for. Surely there are more sophisticated definitions to get around that problem. Do you share that sense or do you think this really undermines something significant?

Charlie Steiner: To go into a little more detail, one thing that I might think of trying to do is rather than sticking with reality, where people are chaotic (and may, for example, commit suicide, as a very discontinuous sort of jump), you might put a distribution on what sort of self-altering actions they could take and then average everything together and try to remove this stochastic behaviour.

But I think that the non-linearities in what actions you take kill that, unless you are doing this averaging in a very high-dimensional abstract space where you're representing all the jumps as linear anyhow. So yes, I don't know. I think it's plausible, but I think I'll keep coming up with problems for a while.

Does that make sense?

Daniel Kokotajlo: What I meant was suppose you admit that it's discontinuous sometimes, it seems like you could still say things like, "This big basin of attraction is probably bad and we maybe don't want to end up there." Or you could be like, "Well, actually, in this particular case, I jump straight into that base and in one particular jump."

And maybe that's a reason to think that, it's actually not so bad, at least from my perspective. Because in some sense, it's a very short distance from me because there was only one jump. It doesn't seem to me that accepting discontinuities into your system necessarily ruins the overall picture.

Charlie Steiner: Yes. I would more or less agree. I don't know. I think it's hard to say. Once you accept that sort of thing, it's hard to say, "Well, it's not so bad because it's a small distance because it's just one jump.” Because, I don't know, distance is exactly the problem. But I don't know. There's more problems I didn't go into.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

13

Meta-preferences are weird

13

13

Questions