generalized wireheading

Tamsin Leake

25 generalized wireheading

by Tamsin Leake

18th Nov 2022

carado.moe

2 min read

25

many systems "want" to "wirehead" — which is to say, they want to hijack, and maximize, their reward signal.

humans often want to. not always, but sometimes, and this might be true even under reflection: some people (believe they) truly axiomatically only care to be in a state where they're satisfied, others have values about what actually happens in the world (which is actually possible and meaningful to do!).

reinforcement learning AIs such as AIXI want to wirehead: they want to just do whatever will maximize their reward. if there is a function in place that looks at the amount of happiness in the world and continuously rewards such an AI by that much, then the AI will do whatever is easiest, whether that's do what makes that function return the highest value, or replace the function with a constant returning the maximum value. (if it does so consequentially, such as by observing that it's more likely to get even more reward in the future by taking over the world, then it'll still do just that, so we can't necessarily count on wireheading to stop world-consuming AIs.)

(it's true that "reward is not the optimization target" of learned policies — AIs that are first trained in an RL environment, and then deployed into the world without that reward mechanism. but i think it is true of agents that continuously get rewarded and trained even after deployment.)

some bad philosophical perspectives claim to want society to wirehead: they want to get a society where everyone is as satisfied as possible with how things are, without realizing that a goal like that is easily hijacked by states such as everyone wants to do nothing all day, or where everyone is individually wireheaded. we do not in fact want that: in general, we'd like the future to be interesting and have stuff going on. it is true that by happenstance we have not historically managed to turn everyone into a very easily satisfied wireheaded person ("zombie"), but that shouldn't make us falsely believe that, purely by chance, this will never be the case. if we want to be sure we robustly don't become zombies, we have to make sure we actually don't implement a philosophy that would be most satisfied by zombies.

the solution to all of those, is to bite the bullet of value lock-in. there are meta-values that are high-level enough that we do in fact want them to guide the future — even within the set of highly mutable non-axiomatic values, we still have preferences for valuing some of those futures over others. past user satisfaction embodies this well as a solution: it is in fact true that i should want (the coherent extrapolated volition of) my values to determine all of the future light-cone, and this recursively takes care of everything — including adding randomness/happenstance where it ought to be, purposefully.

just like alignment, making the mistake of saying "i just want people in the future to be satisfied!" is a mistake that can isomorphically be found in many fields, and in fact is not where we should want to steer the future, because its canonical endpoint is just something like wireheading. we want (idealized, meta-)value lock-in, not the satisfaction of whatever-will-exist. fundamentally, we want the future to satisfy the values of us now, not people/things later.

of course, those values of us now happen to be fairly cosmopolitan and entail, instrumentally, that people in the future indeed largely be satisfied. but this ought to ultimately be under the terms of our current cosmopolitan (meta-)values, rather than a blind notion of just filling the future with things that get what they want without caring what those wants are.

WireheadingAIWorld Optimization

Frontpage

25

Mentioned in

60your terminal values are complex and not objective

55so you think you're not qualified to do technical alignment research?

generalized wireheading

New Comment

7 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:17 AM

[-]romeostevensit2y41

I don't think you need value lock in to get the desirable properties you want here. Avoiding tiling through complexity/exploration gets you most of the same stuff.

[-]leogao2y30

The meta-values thing gets at the same thing that HRH is getting at. Also, I feel like fundamentally wireheading is a problem of embeddedness, and has a completely different causal story to the problem of reflective processes changing our values to be "zombified", though they feel vaguely similar. The way I would look at this is if you are a non-embedded algorithm running in an embedded world, you are potentially susceptible to wireheading, and only if you are an embedded algorithm then you could possible have a preference that implies wanting zombification, or preferences guided by meta-values that avoid this, etc.

[-]Richard_Kennaway2y2-1

the solution to all of those, is to bite the bullet of value lock-in.

I do not bite this bullet. I am wise enough to know that I am not wise enough to be Eternal Dictator of the Future Light-Cone. My preferences about the entire future light-cone are several levels of meta removed from trivialities like suffering and satisfaction: there should never be a singleton, regardless of its values. That's about it.

[-]Cynosure2y1-1

"fundamentally, we want the future to satisfy the values of us now"

What if we don't want this after the fact? Is it not the case that our values have changed pretty radically in the last 100 years, much less the last 800-1000? If we do create some kind of value-enforcing smatter/"sovereign"/whathave you, is that not a kind of horror, the social ethic being frozen in time?

I hope for a future that vastly exceeds the conceptual & ethical values of the time we find ourselves in.

[-]Tamsin Leake2y53

it is true that past society has failed to align current society; we're glad because we like our values, they'd be upset because they prefer theirs. we are ourselves now and so we want to align the future to our own values.

there's also the matter that people in the past might agree with our values more under reflection, the same way we'd probly not want the meat industry under reflection.

we should want many of our non-terminal values to change over time, of course! we just want to make sure our terminal values are in charge of how that actually happens. there is a sufficiently high meta level at which we do want our values forever, rather than some other way-our-instrumental-values-could-change which we wouldn't like as much.

[-]Viliam2y30

As a thought experiment, imagine that human values change cyclically. For 1000 years we value freedom and human well-being, for 1000 years we value slavery and the joy of hurting other people, and again, and again, forever... that is, unless we create a superhuman AI who can enforce a specific set of values.

Would you want the AI to promote the values of freedom and well-being, or the slavery and hurting, or something balanced in the middle, or to keep changing its mind in a cycle that mirrors the natural cycle of human values?

(It is easy to talk about "enforcing values other than our own" in abstract, but it becomes less pleasant when you actually imagine some specific values other than your own.)

[-]Richard_Kennaway2y31

Is it not the case that our values have changed pretty radically in the last 100 years, much less the last 800-1000?

To be a little pedantic, the people alive 100 years ago and those alive today are for practical purposes disjoint. No individual's values need have changed much. The old die, the young replace them, and the middle-aged blow with the wind between.

Moderation Log