LESSWRONG
LW

1704
nostream
6010
Message
Dialogue
Subscribe

ML engineer and phenomenology (read: meditation) enthusiast

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Foom & Doom 2: Technical alignment is hard
nostream3mo70

Thanks for the detailed post. I'd like to engage with one specific aspect - the assumptions about how RL might work with scaled LLMs. I've chosen to focus on LLM architectures since that allows more grounded discussion than novel architectures; I am in the "LLMs will likely scale to ASI" camp, but much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them. (If a random guy develops sudden ASI in his basement via some novel architecture, I agree that that tends to end very poorly.)

The post series views RL as mathematically specified reward functions that are expressible in a few lines of Python, which naturally leads to genie/literal-interpretation concerns. However, present day RL is more complicated and nuanced:

  • RLHF and RLAIF operate on human preferences rather than crisp mathematical objectives
  • Labs are expanding RLVR (RL from Verifiable Rewards) beyond simple mathematical tasks to diverse domains
  • The recent Dwarkesh episode with Sholto Douglas and Trenton Bricken (May '25) discusses how labs are massively investing in RL diversity and why they reject the "RL just selects from the pretraining distribution" critique (which may apply to o1-scale compute but likely not o3 and even less so o5-scale)

We're seeing empirical progress on reward hacking:

  • Claude 3.7 exhibited observable reward hacking in deployments
  • Anthropic's response was to specifically address this, resulting in Claude 4 hacking significantly less
    --> This demonstrates both that labs have strong incentives to reduce reward hacking and that they're making concrete progress

The threat model requires additional assumptions: To get dangerous reward hacking despite these improvements, we'd need models that are situationally aware enough to selectively reward hack only in specific calculated scenarios while avoiding detection during training/evaluation. This requires much more sophistication and subtlety than the current reward hacks.

Additionally, one could imagine RL environments specifically designed to train against reward hacking behaviors, teaching models to recognize and avoid exploiting misspecified objectives. When training is diversified across many environments and objectives, systematic hacking becomes increasingly difficult. None of this definitively proves LLMs will scale safely to ASI, but it does suggest the risk is less than proposed here.

Reply