On the epistemic point: yes, and this is something current-gen LLMs seem actually useful for, with little risk. This is where their idea generation is useful, and their poor taste and sycophancy doesn't matter.
I've had success asking LLMs for counterarguments. Most of them are dumb and you can dismiss them. But they're smart enough to come up with some good ones once you've steelmanned them and judged their worth for yourself.
This seems less helpful than getting pushback from informed people. But that's hard to find; I've had experiences like yours with Zac HD, in which a conversation fails to surface pretty obvious-in-hindsight counterarguments, just because the conversation focused elsewhere. And I have gotten good pushback by asking LLMs repeatedly in different ways, as far back as o1.
On the object level on your example: I assume a lot of us aren't very engaged with pause efforts or hopes because it seems more productive and realistic to work on reducing ~70% toward ~35% misalignment risks. It seems very likely, that we're gonna barrel forward through any plausible pause movement, but not clear (even after trying to steelman every major alignment difficulty argument) that alignment is insoluble - if we can just collectively pull our shit halfway together while racing toward that cliff.
I assume a lot of us aren't very engaged with pause efforts or hopes because it seems more productive and realistic to work on reducing ~70% toward ~35% misalignment risks.
Nod. I do just, like, don't think that's actually that great a strategy – it presupposes it is actually easier to get from 70% to 35% than from 35% to 5%. I don't see Anthropic-et-al actually really getting ready to ask the sort of questions that would IMO be necessary to actually do-the-reducing.
I'm not getting your 35% to 5% reference? I just have no hope of getting as low as 5%, but a lot of hope for improving on just letting the labs take a swing.
I fully agree that Anthropic and the other labs doen't seem engaged with the relevant hard parts of the problem. That's why I want to convince more people that actually understand the problem to identify and work like mad on the hard parts like the world is on fire, instead of hoping it somehow isn't or can be put out.
It may not be that great a strategy, but to me it seems way better than hoping for pause. I think we can get a public freakout before gametime, but even that won't produce a pause once the government and military is fully AGI-pilled.
This is a deep issue I've been wanting to write about, but haven't figured out how to address without risking further polarization within the alignment community. I'm sure there's a way to do it productively.
That's why I want to convince more people that actually understand the problem to identify and work like mad on the hard parts like the world is on fire, instead of hoping it somehow isn't or can be put out.
FYI something similar to this was basically my "last year's plan", and it's on hold because I think it is plausible right now to meaningfully move the overton window around pauses or at least dramatic slowdowns. (This is based on seeing the amount of traffic AI 2027 got, and the number of NatSec endorsements that If Anyone Builds It Got, and having recently gotten to read it and thinking it is pretty good)
I think if Yoshua Bengio, Geoffrey Hinton, or Dario actually really tried to move overton windows instead of sort of trying to manuever within the current one, it'd make a huge difference. (I don't think this means it's necessarily tractable for most people to help. It's a high-skill operation)
(Another reason for me putting "increase the rate of people able to think seriously about the problem" on hold is that my plans there weren't getting that much traction. I have some models of what I'd try next when/if I return to it but it wasn't a slam dunk to keep going)
There's a mistake I made a couple times and didn't really internalize the lesson as fast as I'd like. Moreover, it wasn't even a failure to generalize, it was basically a failure to even have a single update stick about a single situation.
The particular example was me saying, roughly:
Look, I'm 60%+ on "Alignment is quite hard, in a way that's unlikely to be solved without a 6+ year pause." I can imagine believing it was lower, but it feels crazy to me to think it's lower than like 15%. And at 15%, it's still horrendously irresponsible to solve AI takeoff via rushing forward and winging-it than "everybody stop, and actually give yourselves time think." (
The error mode here is something like "I was imagining what I'd think if you slid this one belief slider from ~60%+ to 15%, without imagining all the other beliefs that would probably be different if I earnestly believed the 15%."
That error feels like a "reasonable honest mistake."
But, the part where I was like "C'mon guys, even if you only, like, sorta-kinda agreed with me on this point, you'd still obviously be part of my political coalition for a global halt that is able to last 10+ years, right?"
...that feels like a more pernicious, political error. A desire to live in the world where my political coalition has more power, and a bit of an attempt to incept others into thinking it's true.
(This is an epistemic error, not necessarily a strategic error. Political coalitions are often won by people believing in them harder than it made sense to. But, given that I've also staked my macrostrategy on "LessWrong is a place for shared mapmaking, and putting a lot of effort to hold onto that even as the incentives push towards political manuevering," I'd have to count it as a strategic error for me in this context)
The specific counterarguments I heard were:
Now I'm not arguing that those rejoinders are slam dunks. But, I hadn't thought of them when I was making the argument, and I don't currently have a strong counter-counterargument at the moment. Upon reflection, I can see a little slippery-graspy move I was doing where I was hoping to skip over the hard work of fully simulating another perspective and addressing all their points.
(to spell out: the above arguments are specifically against "if AI alignment is only 15% likely to difficult enough to require a substantial pause, you should [be angling a bit to either pause or at least preserve option-value to pause". It's not an argument against alignment likely requiring a pause)
...
I do still overall think we need a long pause to have a decent chance of non-horrible things happening. And I still feel like something epistemically slippery is going on in the worldviews of most people who are hopeful about survival in a world where companies continue mostly rushing towards superintelligence.
But, seems good for me to acknowledge when I did something epistemically slippery myself. In particular given that I think that epistemic-slipperiness is a fairly central problem in the public conversation about AI, and it'd probablyhelp to get better at public convos about it.
Notice thoughts like "anyone who even believes a weak version of My Thing should end up agreeing with my ultimate conclusion", and hold them with at least a bit of skepticism. (The exact TAP probably depends a bit on the situation)
More generally, remember if that variation in belief often doesn't just turn on a single knob, if someone disagrees with one piece they probably disagreeabout a bunch of other pieces. Disagreements are more frustratingly fractal than you might hope.
(See also: "You can't possibly succeed without [My Pet Issue]")
I first made this-sort-of-claim in a conversation with Zac Hatfield-Dodds that I'd later recount on Anthropic, and taking "technical philosophy" more seriously. I don't think I actually made the error here exactly). But in the comments, Ryan Greenblatt replied with some counterarguments and I said "oh, yeah that makes sense", and later in The Problem I ended up running through the same loop with Buck.