Thanks for writing! I agree the factors this post describes make some types of gradient hacking extremely difficult, but I don't see how they make the following approach to gradient hacking extremely difficult.
Suppose that an agent has some trait which gradient descent is trying to push in direction x because the x-ness of that trait contributes to the agent’s high score; and that the agent wants to use gradient hacking to prevent this. Consider three possible strategies that the agent might try to implement, upon noticing that the x-component of the trait has increased [...] [One potential strategy is] Deterministically increasing the extent to which it fails as the x-component increases.
This approach to gradient hacking seems plausibly resistant to the factors this post describes, by the following reasoning: With the above approach, the gradient hacker only worsens performance by a small amount. At the same time, the gradient hacker plausibly improves performance in other ways, since the planning abilities that lead to gradient hacking may also lead to good performance on tasks that demand planning abilities. So, overall, modifying or reducing the influence of the gradient hacker plausibly worsens performance. In other words, gradient descent might not modify away a gradient hacker because gradient hacking is convergently incentivized behavior that only worsens performance by a small amount (while not worsening it at all on net).
(Maybe gradient descent would then train the model to have a heuristic of not doing gradient hacking, while keeping the other benefits of improved planning abilities? But I feel pretty clueless about whether gradient hacking would be encoded in a way that allows such a heuristic to be inserted.)
(I read kind of quickly so may have missed something.)
Thanks for posting, but I think these arguments have major oversights. This leaves me more optimistic about the extent to which people will avoid and prevent the horrible misuse you describe.
First, this post seems to overstate the extent to which people tend to value and carry out extreme torture. Maximally cruel torture fortunately seems very rare.
Second, this post seems to overlook a major force that often prevents torture (and which, I argue, will be increasingly able to succeed at doing so): many people disvalue torture and work collectively to prevent it.
Third, this post seems to overlook arguments for why AI alignment may be worthwhile (or opposing it may be a bad idea), even if a world with aligned AI wouldn't be worthwhile on its own. My understanding is that most people focused on preventing extreme suffering find such arguments compelling enough to avoid working against alignment, and sometimes even to work towards it.
These oversights strike me as pretty reckless, when arguing for letting (or making) everyone die.
Thanks for writing!
I want to push back a bit on the framing used here. Instead of the framing "slowing down AI," another framing we could use is, "lay the groundwork for slowing down in the future, when extra time is most needed." I prefer this latter framing/emphasis because:
Good points on "lack of mass advocacy efforts" being an overstatement.
I'm not sure if we actually have different impressions here, but I mostly meant that trying to convince influential policymakers about AGI risk is very controversial. I largely think this based on conversations with AI governance people, but one example that's easier to share is that no prominent DC think tank has (as far as I'm aware) released any article or report arguing that AGI risk is real and serious.
Work to spread good knowledge regarding AGI risk / doom stuff among politicians, the general public, etc. [...] Emphasizing “there is a big problem, and more safety research is desperately needed” seems good and is I think uncontroversial.
Nitpick: My impression is that at least some versions of this outreach are very controversial in the community, as suggested by e.g. the lack of mass advocacy efforts.
It does, thanks! (I had interpreted the claim in the paper as comparing e.g. TPUs to CPUs, since the quote mentions CPUs as the baseline.)
I agree with parts of that. I'd also add the following (or I'd be curious why they're not important effects):
More broadly though, maybe we should be using more fine-grained concepts than "shorter timelines" and "slower takeoffs":