Wiki Contributions



I agree with parts of that. I'd also add the following (or I'd be curious why they're not important effects):

  • Slower takeoff -> warning shots -> improved governance (e.g. through most/all major actors getting clear[er] evidence of risks) -> less pressure to rush
  • (As OP argued) Shorter timelines -> China has less of a chance to have leading AI companies -> less pressure to rush

More broadly though, maybe we should be using more fine-grained concepts than "shorter timelines" and "slower takeoffs":

  • The salient effects of "shorter timelines" seem pretty dependent on what the baseline is.
    • The point about China seems important if the baseline is 30 years, and not so much if the baseline is 10 years.
  • The salient effects of "slowing takeoff" seem pretty dependent on what part of the curve is being slowed. Slowing it down right before there's large risk seems much more valuable than (just) slowing it down earlier in the curve, as the last few year's investments in LLMs did.

Thanks for writing! I agree the factors this post describes make some types of gradient hacking extremely difficult, but I don't see how they make the following approach to gradient hacking extremely difficult.

Suppose that an agent has some trait which gradient descent is trying to push in direction x because the x-ness of that trait contributes to the agent’s high score; and that the agent wants to use gradient hacking to prevent this. Consider three possible strategies that the agent might try to implement, upon noticing that the x-component of the trait has increased [...] [One potential strategy is] Deterministically increasing the extent to which it fails as the x-component increases.

(from here)

This approach to gradient hacking seems plausibly resistant to the factors this post describes, by the following reasoning: With the above approach, the gradient hacker only worsens performance by a small amount. At the same time, the gradient hacker plausibly improves performance in other ways, since the planning abilities that lead to gradient hacking may also lead to good performance on tasks that demand planning abilities. So, overall, modifying or reducing the influence of the gradient hacker plausibly worsens performance. In other words, gradient descent might not modify away a gradient hacker because gradient hacking is convergently incentivized behavior that only worsens performance by a small amount (while not worsening it at all on net).

(Maybe gradient descent would then train the model to have a heuristic of not doing gradient hacking, while keeping the other benefits of improved planning abilities? But I feel pretty clueless about whether gradient hacking would be encoded in a way that allows such a heuristic to be inserted.)

(I read kind of quickly so may have missed something.)


Ah sorry, I meant the ideas introduced in this post and this one (though I haven't yet read either closely).


Thanks for posting, but I think these arguments have major oversights. This leaves me more optimistic about the extent to which people will avoid and prevent the horrible misuse you describe.

First, this post seems to overstate the extent to which people tend to value and carry out extreme torture. Maximally cruel torture fortunately seems very rare.

  • The post asks "How many people have you personally seen who insist on justifying some form of suffering for those they consider undesirable[?]" But "justifying some form of suffering" isn't actually an example of justifying extreme torture.
  • The post asks, "What society hasn’t had some underclass it wanted to put down in the dirt just to lord power over them?" But that isn't actually an example of people endorsing extreme torture.
  • The post asks, "How many ordinary, regular people throughout history have become the worst kind of sadist under the slightest excuse or social pressure to do so to their hated outgroup?" But has it really been as many as the post suggests? The historical and ongoing atrocities that come to mind were cases of serious suffering in the context of moderately strong social pressure/conditioning--not maximally cruel torture in the context of slight social pressure.
  • So history doesn't actually give us strong reasons to expect maximally suffering-inducing torture at scale (edit: or at least, the arguments this post makes for that aren't strong).

Second, this post seems to overlook a major force that often prevents torture (and which, I argue, will be increasingly able to succeed at doing so): many people disvalue torture and work collectively to prevent it.

  • Torture tends to be illegal and prosecuted. The trend here seems to be positive, with cruelty against children, animals, prisoners, and the mentally ill being increasingly stigmatized, criminalized, and prosecuted over the past few centuries.
  • We're already seeing AI development being highly centralized, with this leading AI developers working to make their AI systems hit some balance of helpful and harmless, i.e. not just letting users carry out whatever misuse they want.
  • Today, the cruelest acts of torture seem to be small-scale acts pursued by not-very-powerful individuals, while (as mentioned above) powerful actors tend to disvalue and work to prevent torture. Most people will probably continue to support the prevention and prosecution of very cruel torture, since that's the usual trend, and also because people would want to ensure that they do not themselves end up as victims of horrible torture. In the future, people will be better equipped to enforce these prohibitions, through improved monitoring technologies.

Third, this post seems to overlook arguments for why AI alignment may be worthwhile (or opposing it may be a bad idea), even if a world with aligned AI wouldn't be worthwhile on its own. My understanding is that most people focused on preventing extreme suffering find such arguments compelling enough to avoid working against alignment, and sometimes even to work towards it.

  • Concern over s-risks will lose support and goodwill if adherents try to kill everyone, as the poster suggests they intend to do ("I will oppose any measure which makes the singularity more likely to be aligned with somebody’s values"). Then, if we do end up with aligned AI, it'll be significantly less likely that powerful actors will work to stamp out extreme suffering.
  • The highest-leverage intervention for preventing suffering is arguably coordinating/trading with worlds where there is a lot of it, and humanity won't be able to do that if we lose control of this world.

These oversights strike me as pretty reckless, when arguing for letting (or making) everyone die.


Thanks for writing!

I want to push back a bit on the framing used here. Instead of the framing "slowing down AI," another framing we could use is, "lay the groundwork for slowing down in the future, when extra time is most needed." I prefer this latter framing/emphasis because:

  • An extra year in which the AI safety field has access to pretty advanced AI capabilities seems much more valuable for the field's progress (say, maybe 10x) than an extra year with current AI capabilities, since the former type of year would give the field much better opportunities to test safety ideas and more clarity about what types of AI systems are relevant.
    • One counterargument is that AI safety will likely be bottlenecked by serial time, because discarding bad theories and formulating better ones takes serial time, making extra years early on very useful. But my very spotty understanding of the history of science suggests that it doesn't just take time for bad theories to get replaced by better ones--it takes time along with the accumulation of lots of empirical evidence. This supports the view that late-stage time is much more valuable than early-stage time.
  • Slowing down in the future seems much more tractable than slowing down now, since many critical actors seem much more likely to support slowing down if and when there are clear, salient demonstrations of its importance (i.e. warning shots).
  • Given that slowing down later is much more valuable and much more tractable than just slowing down now, it seems much better to focus on slowing down later. But the broader framing of "slow down" doesn't really suggest that focus, and maybe it even discourages it.

Work to spread good knowledge regarding AGI risk / doom stuff among politicians, the general public, etc. [...] Emphasizing “there is a big problem, and more safety research is desperately needed” seems good and is I think uncontroversial.

Nitpick: My impression is that at least some versions of this outreach are very controversial in the community, as suggested by e.g. the lack of mass advocacy efforts. [Edit: "lack of" was an overstatement. But these are still much smaller than they could be.]


It does, thanks! (I had interpreted the claim in the paper as comparing e.g. TPUs to CPUs, since the quote mentions CPUs as the baseline.)


Thanks! To make sure I'm following, does optimization help just by improving utilization?


Sorry, I'm a bit confused. I'm interpreting the 1st and 3rd paragraphs of your response as expressing opposite opinions about the claimed efficiency gains (uncertainty and confidence, respectively), so I think I'm probably misinterpreting part of your response?


This is helpful for something I've been working on - thanks!

I was initially confused about how these results could fit with claims from this paper on AI chips, which emphasizes the importance of factors other than transistor density for AI-specialized chips' performance. But on second thought, the claims seem compatible:

  • The paper argues that increases in transistor density have (recently) been slow enough for investment in specialized chip design to be practical. But that's compatible with increases in transistor density still being the main driver of performance improvements (since a proportionally small boost that lasts several years could still make specialization profitable).
  • The paper claims that "AI[-specialized] chips are tens or even thousands of times faster and more efficient than CPUs for training and inference of AI algorithms." But the graph in this post shows less than thousands of times improvements since 2006. These are compatible if remaining efficiency gains of AI-specialized chips came before 2006, which is plausible since GPUs were first released in 1999 (or maybe the "thousands of times" suggestion was just too high).
Load More