Hackable Rewards as a Safety Valve? — LessWrong

x

Hackable Rewards as a Safety Valve? — LessWrong

New Comment

17 comments, sorted by

Click to highlight new comments since: Today at 6:39 PM

[-]Wei Dai7yΩ5121

I do think there's a bunch of unsafe uncanny valleys (which may or may not add up to one big unsafe uncanny valley) but I'm not sure this actually is one. Once a superintelligent AI succeeds in hacking its reward function, would it not likely be motivated to protect itself and its hacked reward signals from outside tampering (such as the AI operators trying to shut it down)? It seems like the only way to ensure protection is to take over the universe and make sure there are no other agents in it (except ones aligned to this AI). And if it's not motivated to protect itself, the AI builder responsible for creating it in the first place would likely just shut it down and try again with a different design (which would probably still be unsafe, given that they're that far from a safe design), so overall it doesn't seem like hackable rewards is much of a safety valve.

[-]johnswentworth7yΩ3100

It would be much easier for the AI to hack its own expectation operator, so that it predicts a 100% chance of continued survival, rather than taking over the universe. If you're gonna wirehead, why stop early?

I do agree that the builder would probably just try another design. Ideally, they keep adding hacks to make wireheading harder until the AI kills the builder and wireheads itself - hopefully without killing everyone else in the process.

[-]Wei Dai7yΩ570

It would be much easier for the AI to hack its own expectation operator, so that it predicts a 100% chance of continued survival

It's very easy to build an AI that wouldn't do this kind of hack, because the AI just has to use its current expectation operator when evaluating whether or not to hack its own expectation operator.

It's much harder to build an AI that wouldn't do other, more damaging, kinds of reward hacking (if the AI is designed around reward maximization in the first place).

[-]johnswentworth7yΩ470

Could you give an example for the latter, which wouldn't also apply to hacking the expectation operator? The argument sounds plausible, but I'm not yet seeing what qualitative difference between the expectation and utility operators would make a wireheading AI modify one but not the other.

[-]Wei Dai7yΩ6150

If you're thinking of a utility-maximizing agent, then it typically wouldn't modify its own utility function. Instead I'm talking about reward-maximizing agents, which do not have internal utility functions but just try to maximize a reward signal coming from the outside, and "reward function" refers to the function computed by whatever is providing it with rewards.

So a utility maximizing agent, like a paperclip-maximizer, can think "If I change my utility function to always return MAX_INT, then according to my current utility function, the universe will have very low expected utility." But this kind of reasoning isn't available to a reward-maximizing agent, because it doesn't normally have access to the reward function. Instead it can only be programmed to think thoughts like "If I do X, what will be my future expected rewards" and "If I hack the reward function to always return MAX_INT, then my future expected rewards will be really high." Not to mention "If I take over the universe so nobody can shut me down or change the reward function back, my expected rewards will be even higher." (I'm anthropomorphizing to quickly convey the intuitions but all this can be turned into math pretty easily.)

Does this help?

ETA: Note that here I'm interpreting "hack" as "modify" or "tamper with", but people sometimes use "reward hacking" to include "reward gaming" which means not physically changing the reward function but just taking advantage of unintentional flaws in the reward function to get high rewards without doing what the AI designer or user intends. In that sense of "hack", utility hacking would be quite possible if the utility function isn't totally aligned with human values.

[-]johnswentworth7yΩ130

I'm on-board with that distinction, and I was also thinking of reward-maximizers (despite my loose language).

Part of the confusion may be different notions of "wireheading": seizing an external reward channel, vs actual self-modification. If you're picturing the former, then I agree that the agent won't hack its expectation operator. It's the latter I'm concerned with: under what circumstances would the agent self-modify, change its reward function, but leave the expectation operator untouched?

Example: blue-maximizing robot. The robot might modify its own code so that get_reward(), rather than reading input from its camera and counting blue pixels, instead just returns a large number. The robot would do this because it doesn't model itself as embedded in the environment, and it notices a large correlation between values computed by a program running in the environment (i.e. itself) and its rewards. But in this case, the modified function always returns the same large number - the robot no longer has any reason to worry about the rest of the world.

[-]Matthew Barnett7y*Ω110

I'm not yet seeing what qualitative difference between the expectation and utility operators would make a wireheading AI modify one but not the other.

If we are modeling the agent as taking ${argmax}_{π} E [U (π)]$ , then it would easily see that manually setting its reward channel to the maximum would be the best policy. However, it wouldn't see that setting its expectation value to 100% would be the best policy since that doesn't actually increase its reward. [ETA: Assuming its utility function is such that a higher reward = higher utility. Also, I meant $U (x) | π$ not $U (π)$ ].

[-]johnswentworth7yΩ120

So concretely, we have a blue-maximizing robot, it uses its current world-model to forecast the reward from holding a blue screen in front of its camera, and find that it's probably high-reward. Now it tries to minimize the probability that someone takes the screen away. That's the sort of scenario you're talking about, yes?

I agree that Wei Dai's argument applies just fine to this sort of situation.

Thing is, this kind of wireheading - simply seizing the reward channel - doesn't actually involve any self-modification. The AI is still "working" just fine, or at least as well as it was working before. The problem here isn't really wireheading at all, it's that someone programmed a really dumb utility function.

True wireheading would be if the AI modifies its utility function - i.e. the blue-minimizing robot changes its code (or hacks its hardware) to count red as also being blue. For instance, maybe the AI does not model itself as embedded in the environment, but learns that it gets a really strong reward signal when there's a big number at a certain point in a program executed in the environment - which happens to be its own program execution. So, it modifies this program in the environment to just return a big number for expected_utility, thereby "accidentally" self-modifying.

What I'm not seeing is, in situations where an AI would actually modify itself, when and why would it go for the utility function but not the expectation operator? Maybe people are just imagining "wireheading" in the form of seizing an external reward channel?

[-]Matthew Barnett7yΩ230

Maybe people are just imagining "wireheading" in the form of seizing an external reward channel?

Admittedly, that's how I understood it. I don't see why an expected utility maximizer would modify its utility function, since utility functions are reflectively stable.

[-]johnswentworth7yΩ250

The root issue is that Reward ≠ Utility. A utility function does not take in a policy, it takes in a state of the world - an expected utility maximizer chooses its policy based on what state(s) of the world it expects that policy to induce. Its objective looks like $E [U (x) | π]$ , where $x$ is the state of the world, and the policy/action $π$ matters only insofar as it changes the distribution of $x$ . The utility $π$ is internal to the agent. $x$ , as a function of the world state, is perfectly known to the utility maximizer - the only uncertainty is in the world state $U$ , and the only thing which the agent tries to control is the world-state $U$ . That's why it's reflectively stable: the utility function is "inside" the agent, not part of the "environment", and the agent has no way to even consider changing it.

A reward function, on the other hand, just takes in a policy directly - an expected reward maximizer's objective looks like $E [U (π)]$ . Unlike a utility, the reward is "external" to the agent, and the reward function is unknown to the agent - the agent does not necessarily know what reward it will receive given some state of the world. The reward "function", i.e. the function mapping a state of the world to a reward, is itself just another part of the environment, and the agent can and will consider changing it.

Example: the blue-maximizing robot.

A utility-maximizing blue-bot would model the world, look for all the blue things in its world-model, and maximize that number. This robot doesn't actually have any reason to stick a blue screen in front of its camera, unless its world-model lacks object permanence. To make a utility-maximizing blue-bot which does sit in front of a blue screen would actually be more complicated: we'd need a model of the bot's own camera, and a utility function over the blue pixels detected by that camera. (Or we'd need a world-model which didn't include anything outside the camera's view.)

On the other hand, a reward-maximizing blue-bot doesn't necessarily even have a notion of "state of the world". If its reward is the number of blue pixels in the camera view, that's what it maximizes - and if it can change the function mapping external world to camera pixels, in order to make more pixels blue, then it will. So it happily sits in front of a blue screen. Furthermore, a reward maximizer usually needs to learn the reward function, since it isn't built-in. That leads to the sort of problem I mentioned above, where the agent doesn't realize it's embedded in the environment and "accidentally" self-modifies. That wouldn't be a problem for a true utility maximizer with a decent world-model - the utility maximizer would recognize that modifying this chunk of the environment won't actually cause higher utility, it's just a correlation.

[-]Matthew Barnett7yΩ350

The root issue is that Reward ≠ Utility

Agreed. When I wrote $U (π)$ I meant it as shorthand for $U (x) | π$ , though now that I look at it I can see that was criss-crossing between reward and utility in a very confusing way.

That leads to the sort of problem I mentioned above, where the agent doesn't realize it's embedded in the environment and "accidentally" self-modifies.

That makes sense now, although I am still curious whether there is a case where it purposely self modifies rather than accidentally does so.

[-]Davidmanheim7yΩ120

My claim here is that superintelligence is a result of training, not a starting condition. Yes, a SAI would do bad things unless robustly aligned, but building the SAI requires it not to wirehead at an earlier stage in the process. My claim is that I am unsure that there is a way to train such a system that was not built with safety in mind such that it gets to a point where it is more likely to gain intelligence than it is to find ways to reward hack - not necessarily via direct access, but via whatever channel is cheapest. And making easy-to-subvert channels harder to hit seems to be the focus of a fair amount of non-SAI-concerned AI safety work, which seems like a net-negative.

[-]Wei Dai7yΩ340

My claim is that I am unsure that there is a way to train such a system that was not built with safety in mind such that it gets to a point where it is more likely to gain intelligence than it is to find ways to reward hack—not necessarily via direct access, but via whatever channel is cheapest.

I think this claim makes more sense than the one you quoted at the top of your post, "no superintelligent AI is going to bother with a task that is harder than hacking its reward function", and my initial comment was mostly responding to that.

And making easy-to-subvert channels harder to hit seems to be the focus of a fair amount of non-SAI-concerned AI safety work, which seems like a net-negative.

But I don't understand how you can expect this (i.e., non-SAI-concerned AI safety work that make easy-to-subvert channels harder to hit) to not happen, or to make it significantly less likely to happen, given that people want to build AIs that do things beside reward hacking, so if those AI start reward hacking or will predictably do that, people are going to think of ways to secure the channels that are being hacked. So even if AI safety people refrain from working on this, AI capabilities people eventually will, and I don't see them being slowed down much by having to harden the easy-to-subvert channels.

Do you have some other strategic implications in mind here?

[-]Davidmanheim7yΩ120

But I don't understand how you can expect this (i.e., non-SAI-concerned AI safety work that make easy-to-subvert channels harder to hit) to not happen, or to make it significantly less likely to happen, given that people want to build AIs that do things beside reward hacking

I was mostly noting that I hadn't thought of this, hadn't seen it mentioned, and so my model for returns to non-fundamental alignment AI safety investments didn't previously account for this. Reflecting on that fact now, I think the key strategic implication relates to the ongoing debate about prioritization of effort in AI-safety.

(Now, some low confidence speculation on this:) People who believe that near-term Foom! is relatively unlikely, but worry about misaligned non-superintelligent NAI/Near-human AGI, may be making the Foom! scenario more likely. That means that attention to AI safety that pushes for "safer self-driving cars" and "reducing and mitigating side-effects" is plausibly a net negative if done poorly, instead of being benign.

[-]Wei Dai7yΩ340

I was mostly noting that I hadn’t thought of this, hadn’t seen it mentioned

There was some related discussion back in 2012 but of course you can be excused for not knowing about that. :) (The part about "AIXI would fail due to incorrect decision theory" is in part talking about reward-maximizing agent doing reward hacking.)

[-]Gordon Seidoh Worley7y90

This reminds me of a point sometimes relevant to Buddhist practice.

Now you can disagree with this claim, but suppose it is true for a moment: with practice you can cultivate greater deliberate control such that you more succeed at manifesting your intentions in reality (this is somewhat backwards from how I would frame it from the Zen perspective, but I think it's more easily relatable this way). Put another way, you become less confused about the external world and the workings of yourself such that you can deftly turn your desires into reality, not by any magic tricks, but by being a very hard to distract or confuse optimizer.

There's an argument I've heard and mostly endorse that, as a consequence of this, it's a good thing most people have to receive extensive training to achieve this kind of power because it gives a chance to improve their virtue, ethics, morality, and most of all compassion. And this is viewed as important because without those things the kind of power I'm describing would likely be mostly destructive if used to execute the whims of the average person. Like, even on reflection the person with the power would likely consider the effects destructive, but they are so far out of reflective equilibrium they don't notice until after the consequences of their actions become apparent. So in some sense humans being poor optimizers relative to what's possible is protective as humans (and earlier non-human ancestors) who were better optimizers mostly selected themselves out of existence by optimizing hard on dangerous things. We are descended from those who optimized only just enough to survive but not so much as to put their survival and reproduction at risk.

I think we can extrapolate from this a similar principle for AI safety, the human version being something like "more optimization than you know how to handle makes you dangerous to yourself and others".

[-]Donald Hobson7yΩ230

Intuition Dump. Safety through this mechanism seems to me like aiming a rocket at mars, and accidently hitting the moon. There might well be a region where P(doom|superintelligence) is lower, but lower in the sense of 90% is lower than 99.9%. Suppose we have the clearest, most stereotypical case of wireheading as I understand it. A mesa optimizer with a detailed model of its own workings and the terminal goal of maximizing the flow of current in some wire. (During training, current in the reward signal wire reliably correlated with reward.)

The plan that maximizes this flow long term is to take over the universe and store all the energy, to gradually turn into electricity. If the agent has access to its own internals before it really understands that it is an optimizer, or is thinking about the long term future of the universe, it might manage to brick its self. If the agent is sufficiently myopic, is not considering time travel or acausal trade, and has the choice of running high voltage current through itself now, or slowly taking over the world, it might choose the former.

Note that both of these look like hitting a fairly small target in design, and lab enviroment space. The mesa optimizer might have some other terminal goal. Suppose the prototype AI has the opportunity to write arbitrary code to an external computer, and an understanding of AI design before it has self modification access. The AI creates a subagent that cares about the amount of current in a wire in the first AI, the subagent can optimize this without destroying itself. Even if the first agent then bricks itself, we have a AI that will dig the fried circuit boards out the trashcan, and throw all the cosmic commons into protecting and powering them.

In conclusion, this is not a safe part of agentspace, just a part that's slightly less guaranteed to kill you. I would say it was of little to no strategic importance. Especially if you think that all humans making AI will be reasonably on the same page regarding safety, scenarios where AI alignment is nearly solved, and the first people to ASI barely know the field exists are unlikely. If the first ASI self destructs for reasons like this, we have all the peaces for making superintelligence, and people with no sense safety are trying to make one. I would expect another attempt a few weeks later to doom us. (Unless the first AI bricked itself in a sufficiently spectacular manor, like hacking into nukes to create a giant EMP in its circuits. That might get enough people seeing danger to have everyone stop.)