Reading Deepmind's latest research and accompanying blogpost, I wanted to highlight an under-appreciated aspect of safety. As a bit of background, Carlos Perez points out Josha Bach's "Lebowski theorem," which states that "no superintelligent AI is going to bother with a task that is harder than hacking its reward function." Given that, I see a potential perverse effect of some types of alignment research - especially research into embedded agency and robust alignment which makes AI uninterested in reward tampering. (Epistemic Status: my confidence in the argument is moderate, and I am more confident in the earlier claims.)
In general, unsafe AI is far more likely to tamper with its reward function than to find more distant (and arguably more problematic) ways to tamper with the world to maximize its objective. (epistemic status: fairly high confidence) Once an AI is smart enough to spend its time reward hacking, then wasting time on developing greater intelligence is unneeded. For that reason, this theorem seems likely to function as at least a mild safety valve. It's only if we close this valve too tightly that we would plausibly see ML that reached human-level intelligence. At that point, of course, we should expect that the AI will begin to munchkin the system, just as a moderately clever human would. And anti-munchkin-ing is a narrow instance of security more generally.
Security generally is like cryptography narrowly in an importance sense; it's easy to build a system that you yourself can't break, but very challenging to build one that others cannot exploit. (Epistemic status: more speculative) This means that even if our best efforts go towards safety, an AI seems very unlikely to need more than "mild" superintelligence to break it - unless it's been so well aligned that it doesn't want to hack its objective function.
This logic implies (Epistemic status: most speculative, still with some confidence) that moderate progress in AI safety is potentially far more dangerous than very little progress - and raises critical questions of how close to this unsafe uncanny valley we currently are, and how wide the valley is.
If you're thinking of a utility-maximizing agent, then it typically wouldn't modify its own utility function. Instead I'm talking about reward-maximizing agents, which do not have internal utility functions but just try to maximize a reward signal coming from the outside, and "reward function" refers to the function computed by whatever is providing it with rewards.
So a utility maximizing agent, like a paperclip-maximizer, can think "If I change my utility function to always return MAX_INT, then according to my current utility function, the universe will have very low expected utility." But this kind of reasoning isn't available to a reward-maximizing agent, because it doesn't normally have access to the reward function. Instead it can only be programmed to think thoughts like "If I do X, what will be my future expected rewards" and "If I hack the reward function to always return MAX_INT, then my future expected rewards will be really high." Not to mention "If I take over the universe so nobody can shut me down or change the reward function back, my expected rewards will be even higher." (I'm anthropomorphizing to quickly convey the intuitions but all this can be turned into math pretty easily.)
Does this help?
ETA: Note that here I'm interpreting "hack" as "modify" or "tamper with", but people sometimes use "reward hacking" to include "reward gaming" which means not physically changing the reward function but just taking advantage of unintentional flaws in the reward function to get high rewards without doing what the AI designer or user intends. In that sense of "hack", utility hacking would be quite possible if the utility function isn't totally aligned with human values.