If you're thinking of a utility-maximizing agent, then it typically wouldn't modify its own utility function. Instead I'm talking about reward-maximizing agents, which do not have internal utility functions but just try to maximize a reward signal coming from the outside, and "reward function" refers to the function computed by whatever is providing it with rewards.

So a utility maximizing agent, like a paperclip-maximizer, can think "If I change my utility function to always return MAX_INT, then according to my current utility function, the universe will h

... (Read more)(Click to expand thread. ⌘F to Expand All)Cmd/Ctrl F to expand all comments on this post

Hackable Rewards as a Safety Valve?

by Davidmanheim 1mo10th Sep 201917 comments


Ω 6

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Reading Deepmind's latest research and accompanying blogpost, I wanted to highlight an under-appreciated aspect of safety. As a bit of background, Carlos Perez points out Josha Bach's "Lebowski theorem," which states that "no superintelligent AI is going to bother with a task that is harder than hacking its reward function." Given that, I see a potential perverse effect of some types of alignment research - especially research into embedded agency and robust alignment which makes AI uninterested in reward tampering. (Epistemic Status: my confidence in the argument is moderate, and I am more confident in the earlier claims.)

In general, unsafe AI is far more likely to tamper with its reward function than to find more distant (and arguably more problematic) ways to tamper with the world to maximize its objective. (epistemic status: fairly high confidence) Once an AI is smart enough to spend its time reward hacking, then wasting time on developing greater intelligence is unneeded. For that reason, this theorem seems likely to function as at least a mild safety valve. It's only if we close this valve too tightly that we would plausibly see ML that reached human-level intelligence. At that point, of course, we should expect that the AI will begin to munchkin the system, just as a moderately clever human would. And anti-munchkin-ing is a narrow instance of security more generally.

Security generally is like cryptography narrowly in an importance sense; it's easy to build a system that you yourself can't break, but very challenging to build one that others cannot exploit. (Epistemic status: more speculative) This means that even if our best efforts go towards safety, an AI seems very unlikely to need more than "mild" superintelligence to break it - unless it's been so well aligned that it doesn't want to hack its objective function.

This logic implies (Epistemic status: most speculative, still with some confidence) that moderate progress in AI safety is potentially far more dangerous than very little progress - and raises critical questions of how close to this unsafe uncanny valley we currently are, and how wide the valley is.


Ω 6