LESSWRONG
LW

DisagreementAIRationality
Frontpage

56

Simulating the *rest* of the political disagreement

by Raemon
2nd Sep 2025
3 min read
7

56

DisagreementAIRationality
Frontpage

56

Simulating the *rest* of the political disagreement
4Seth Herd
4Raemon
4Seth Herd
2Raemon
2Cole Wyeth
2Raemon
2Cole Wyeth
New Comment
7 comments, sorted by
top scoring
Click to highlight new comments since: Today at 5:04 AM
[-]Seth Herd4h40

On the epistemic point: yes, and this is something current-gen LLMs seem actually useful for, with little risk. This is where their idea generation is useful, and their poor taste and sycophancy doesn't matter.

I've had success asking LLMs for counterarguments. Most of them are dumb and you can dismiss them. But they're smart enough to come up with some good ones once you've steelmanned them and judged their worth for yourself.

This seems less helpful than getting pushback from informed people. But that's hard to find; I've had experiences like yours with Zac HD, in which a conversation fails to surface pretty obvious-in-hindsight counterarguments, just because the conversation focused elsewhere. And I have gotten good pushback by asking LLMs repeatedly in different ways, as far back as o1.

On the object level on your example: I assume a lot of us aren't very engaged with pause efforts or hopes because it seems more productive and realistic to work on reducing ~70% toward ~35% misalignment risks. It seems very likely, that we're gonna barrel forward through any plausible pause movement, but not clear (even after trying to steelman every major alignment difficulty argument) that alignment is insoluble - if we can just collectively pull our shit halfway together while racing toward that cliff.

Reply
[-]Raemon3h40

I assume a lot of us aren't very engaged with pause efforts or hopes because it seems more productive and realistic to work on reducing ~70% toward ~35% misalignment risks.

Nod. I do just, like, don't think that's actually that great a strategy – it presupposes it is actually easier to get from 70% to 35% than from 35% to 5%. I don't see Anthropic-et-al actually really getting ready to ask the sort of questions that would IMO be necessary to actually do-the-reducing. 

Reply
[-]Seth Herd3h40

I'm not getting your 35% to 5% reference? I just have no hope of getting as low as 5%, but a lot of hope for improving on just letting the labs take a swing.

I fully agree that Anthropic and the other labs doen't seem engaged with the relevant hard parts of the problem. That's why I want to convince more people that actually understand the problem to identify and work like mad on the hard parts like the world is on fire, instead of hoping it somehow isn't or can be put out.

It may not be that great a strategy, but to me it seems way better than hoping for pause. I think we can get a public freakout before gametime, but even that won't produce a pause once the government and military is fully AGI-pilled.

This is a deep issue I've been wanting to write about, but haven't figured out how to address without risking further polarization within the alignment community. I'm sure there's a way to do it productively.

Reply
[-]Raemon3h20

That's why I want to convince more people that actually understand the problem to identify and work like mad on the hard parts like the world is on fire, instead of hoping it somehow isn't or can be put out.

FYI something similar to this was basically my "last year's plan", and it's on hold because I think it is plausible right now to meaningfully move the overton window around pauses or at least dramatic slowdowns. (This is based on seeing the amount of traffic AI 2027 got, and the number of NatSec endorsements that If Anyone Builds It Got, and having recently gotten to read it and thinking it is pretty good)

I think if Yoshua Bengio, Geoffrey Hinton, or Dario actually really tried to move overton windows instead of sort of trying to manuever within the current one, it'd make a huge difference. (I don't think this means it's necessarily tractable for most people to help. It's a high-skill operation)

(Another reason for me putting "increase the rate of people able to think seriously about the problem" on hold is that my plans there weren't getting that much traction. I have some models of what I'd try next when/if I return to it but it wasn't a slam dunk to keep going)

Reply
[-]Cole Wyeth2h20

Deserves a name. 

Reply1
[-]Raemon5m20

Do you mean like a short pithy name for the fallacy/failure-mode?

Reply
[-]Cole Wyeth1m20

Yes.

Reply
Moderation Log
More from Raemon
View more
Curated and popular this week
7Comments

There's a mistake I made a couple times and didn't really internalize the lesson as fast as I'd like. Moreover, it wasn't even a failure to generalize, it was basically a failure to even have a single update stick about a single situation.

The particular example was me saying, roughly:

Look, I'm 60%+ on "Alignment is quite hard, in a way that's unlikely to be solved without a 6+ year pause." I can imagine believing it was lower, but it feels crazy to me to think it's lower than like 15%. And at 15%, it's still horrendously irresponsible to solve AI takeoff via rushing forward and winging-it than "everybody stop, and actually give yourselves time think." (

The error mode here is something like "I was imagining what I'd think if you slid this one belief slider from ~60%+ to 15%, without imagining all the other beliefs that would probably be different if I earnestly believed the 15%." 

That error feels like a "reasonable honest mistake."

But, the part where I was like "C'mon guys, even if you only, like, sorta-kinda agreed with me on this point, you'd still obviously be part of my political coalition for a global halt that is able to last 10+ years, right?"

...that feels like a more pernicious, political error. A desire to live in the world where my political coalition has more power, and a bit of an attempt to incept others into thinking it's true.

(This is an epistemic error, not necessarily a strategic error. Political coalitions are often won by people believing in them harder than it made sense to. But, given that I've also staked my macrostrategy on "LessWrong is a place for shared mapmaking, and putting a lot of effort to hold onto that even as the incentives push towards political manuevering," I'd have to count it as a strategic error for me in this context)

The specific counterarguments I heard were:

  • If "Superalignment is real hard risk" is only like 15%, you might have primary threat models that are pretty differently shaped, and be focusing your efforts on reducing risk in the other 85% of worlds.
  • Relatedly, my phrasing made more sense if the goal was to cut risk down to something "acceptable" (like, <5%). You might think it's more useful to focus on strategies that are more likely to work, and which cut risk down from, say, 70% to 35%. (which does seem more plausible to me if I believed alignment wouldn't likely require 6+ year pauses to get right).

Now I'm not arguing that those rejoinders are slam dunks. But, I hadn't thought of them when I was making the argument, and I don't currently have a strong counter-counterargument at the moment. Upon reflection, I can see a little slippery-graspy move I was doing where I was hoping to skip over the hard work of fully simulating another perspective and addressing all their points.

(to spell out: the above arguments are specifically against "if AI alignment is only 15% likely to difficult enough to require a substantial pause, you should [be angling a bit to either pause or at least preserve option-value to pause". It's not an argument against alignment likely requiring a pause)

...

I do still overall think we need a long pause to have a decent chance of non-horrible things happening. And I still feel like something epistemically slippery is going on in the worldviews of most people who are hopeful about survival in a world where companies continue mostly rushing towards superintelligence. 

But, seems good for me to acknowledge when I did something epistemically slippery myself. In particular given that I think that epistemic-slipperiness is a fairly central problem in the public conversation about AI, and it'd probablyhelp to get better at public convos about it.

The Takeaways

Notice thoughts like "anyone who even believes a weak version of My Thing should end up agreeing with my ultimate conclusion", and hold them with at least a bit of skepticism. (The exact TAP probably depends a bit on the situation)

More generally, remember if that variation in belief often doesn't just turn on a single knob, if someone disagrees with one piece they probably disagreeabout a bunch of other pieces. Disagreements are more frustratingly fractal than you might hope.

(See also: "You can't possibly succeed without [My Pet Issue]")

Appendix: The prior arguments

I first made this-sort-of-claim in a conversation with Zac Hatfield-Dodds that I'd later recount on Anthropic, and taking "technical philosophy" more seriously. I don't think I actually made the error here exactly). But in the comments, Ryan Greenblatt replied with some counterarguments and I said "oh, yeah that makes sense", and later in The Problem I ended up running through the same loop with Buck.