Josh Snider — LessWrong

The researchers definitely did good work, and for me, this is both bad and surprising news. The misses (e.g., targeting Stalin but getting Lenin, or Catholicism yielding Eastern Orthodoxy) have a clear explanation in that the confused concepts seem close conceptually and thus in latent space. This might give us room for optimism. If fine-tuning on data with Stalinist or Satanist or other vibes can produce a misaligned model, then we either need to fine-tune on data with aligned vibes or just make sure that the bulk of pre-training data is "aligned".

Will We Get Alignment by Default? — with Adrià Garriga-Alonso

Josh Snider2d70

This is nice. It would be even nicer if I believed this debate would be resolved before we built dangerous AI

Warning Aliens About the Dangerous AI We Might Create

Josh Snider17d31

People were too busy worrying whether China or America will win the race, to see Sirius sneaking up on us.

Warning Aliens About the Dangerous AI We Might Create

Josh Snider17d20

> assumption that advance civilization has larger receivers

If they are more advanced than us, wouldn't they either have aligned AI or be AI? In that case, I'm not sure what warning them about our possible AI would do for them?

Chapter 3: Comparing Reality To Its Alternatives

Josh Snider21d10

I've read this story plenty of times before, but this was the first time I saw it on LessWrong. That was a pleasant surprise.

Legible vs. Illegible AI Safety Problems

Josh Snider25d50

This is pretty insightful, but I'm not sure the assumption that we would halt development if there were unsolved legible problems holds. The core issue might not be illegibility, but a risk-tolerance threshold in leadership that's terrifyingly high.

Even if we legibly showed the powers that be that an AI had a 20% chance of catastrophic unsolved safety problems, I'd expect competitive pressure would lead them to deploy such a system anyway.

Shouldn't taking over the world be easier than recursively self-improving, as an AI?

Josh Snider1mo10

The agency and intentionality of current models is still up to debate, but the current versions of Claude, ChatGPT, etc. were all created with the assistance of their earlier versions.

Shouldn't taking over the world be easier than recursively self-improving, as an AI?

Answer by Josh SniderNov 02, 202510

I strongly agree. I expect AI to be able to "take over the world" before it can create a more powerful AI that perfectly shares its values. This matches the Sable scenario Yudkowsky outlined in "If Anyone Builds It, Everyone Dies" where it becomes dangerously capable before solving its own alignment problem.

The problem is that doesn't avert doom. If modern AIs become smart enough to do self-improvement at all, then their makers will have them do it. This has in some ways already started.

AI Science Companies: Evidence AGI Is Near

Josh Snider2mo21

I'll do that next time if that's the way.

Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most "classic humans" in a few decades.

Josh Snider2mo10

This kind of scenario seems pretty reasonable and likely, but I'm much more optimistic about it being morally valuable. Mostly because I expect "grabbiness" to happen sooner and by an AI that is morally valuable.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments