I found myself repeating the same words to multiple people, hence a short post.

I think some of the AI pause/governance advocacy might be net-negative. Three reasons:

  • Most importantly, it's easy to get regulation implemented for reasons different from addressing x-risk, which leads to all sorts of failure modes, where it becomes actually harder to prevent x-risk with further regulation, and we all simply die a bit later[1];
  • Less importantly, when talking about a dangerous technology, it's easy to incentivise governments to race to invest in that technology instead of preventing everyone from getting it;
  • To keep in mind, when talking about non-x-risk concerns that can help the pause, you might be outside your area of expertise and say something that any technical expert would say is wrong and consider you not to know what you're talking about.

Edit: I have somewhat updated my views around that since writing my post; in particular, see this comment.

Edit 2: I further updated my mind; I think this post and the comment still communicate something useful, but I don't necessarily endorse everything here.

Epistemic status: idk, handwavy models, a bunch of relevant experience; some people who disagreed with me have changed their mind when I talked about these points and they haven't made good points in response; I've seen docs that would've been harmful if important people saw them; the authors agreed with some of my object-level objections and changed the texts. Seems good to put out there.

If AI regulation isn't explicitly aimed at x-risk, it can be net-negative

What I think:

It’s pretty important to remember what the aim is. It’s not to slow down AI but to prevent an existential catastrophe. “Slowing AI” might help somewhat, but it’s not enough, and some kinds of "slowing down AI" can make it much harder to get policymakers to also introduce regulation that prevents x-risk.

Some strategies involve advocating for/introducing AI regulations without mentioning x-risk, with the hope of locally slowing down AI progress, building frameworks that can later be used to address x-risk, or fostering relationships with policymakers. Many of them carry significant downside risks and are net-negative.

Many people don't seem to consider politicians and voters to be grown-ups who can listen to arguments for why AI poses an x-risk, and implement AI regulation that slows down AI for the reasons we (and everyone who thought about the problem) want AI to be slowed down. These people propose regulations that they think can help with x-risk but don't present the regulations as motivated by x-risk. Aside from this being dishonest (don't be), it can backfire badly. The proposed regulations helping with other problems as well can be a nice bonus, but if addressing other problems is the only aim the policymakers have, you can end up with AI systems that are safe and ethical until they’re smart enough to kill you.

Instead, you can explain the actual problem! Not necessarily your full thinking: obviously, it makes sense to simplify a lot. But the audience are not children; they’re smart, and they can understand what you’re talking about. And it’s possible to reach them and get them to listen because you have a comparative advantage to every other problem that demands their time: many experts agree that yours is going to kill everyone, soon, unless something highly unusual is done; when it works, it produces a huge incentive for them to try to address this problem, and maybe find experts in AI regulation with proposals that can address the x-risk at hand.

It might be more important to carefully explain why x-risk is real than to propose specific regulation that can, as we know, help with x-risk (especially if we're not locked in a specific form and can get the policymakers to adjust it).


My guess is that historically, either the politicians trying to prevent technological progress have lost, or their countries have lost.

By default, that probably makes the relevant governments more interested in addressing concerns about technology's impact than in preventing technology from developing. We need regulation that prevents anyone from developing ASI without solving ASI alignment, but governments are extremely unlikely to implement policies like that, unless they are presented with arguments that couldn't be made about civilian tech that existed in the past; attempts to implement "helpful" regulation might be actively harmful.

Regulation designed to address shorter-term problems might delay AI development and make it less attractive to invest in general AI, but there will still be a huge financial incentive to train and utilise more general AI systems, and this sort of regulation won't prevent x-risk. There are proposals along the lines of making AI companies responsible for the harm their models can cause, prohibiting public access to models that can be jailbroken or have enough understanding of biology to help develop novel pathogens, etc. All of these are great, but on their own, they don't prevent more advanced AI from being trained and don't help at all if/once the labs can solve these problems.

Imagine you persuade the policymakers to take the "ai+bioweapons=bad", "jailbreaks=not cool", or "explainability=a must" position and address these issues with regulation. Things are seemingly slowing down. But actually, OpenAI invests $2B in preventing jailbreaks in GPT-6, Anthropic creates and sells a service to exclude potentially dangerous bio info from the training datasets, etc., they continue scaling with the hope nothing goes off the rails, but their complicated scalable oversight system starts successfully pursuing alien goals, and soon, everyone dies[2].

Betting the future on the hope that it will be impossible to solve jailbreaks seems bad.

If the regulation doesn’t prevent training potentially existentially dangerous systems, and you haven’t added a robust mechanism that would allow you to make the policymakers change the regulation into one prohibiting potentially existentially dangerous training runs, then the regulation just makes everyone die a bit later.

Approximately, to change already implemented or introduced treaties, laws, or bills, you’d have to make policymakers:

  • listen to updates from you, even if things have since moved from the stage/format where your input was taken into consideration;
  • buy the idea that there are other, actually existential (not job-loss-and-bioterror) dangers as well;
  • listen to you and want to help you despite the surprise that you knew and were worrying about existential risk all along (by the way, your plan is bad if it stops working when people you want to befriend read a NYT article on “these Luddites try to infiltrate political elites to slow down the economically valuable progress because they fear the Terminator”; proposing policies not because you honestly believe they’re good but because you might come off as an expert and gain “political influence”, so later you can raise awareness of the x-risk, seems wrong);
  • persuade everyone who at that point needs to be persuaded to influence and amend the regulation so it addresses x-risk, even though getting everyone on board with amending the regulations can be harder, there can be some institutional lock-in into the existing regulations, with many people already convinced that these regulations need to exist for the stated reasons and address the stated problems (e.g., imagine you got the law-makers on board because they thought it's fine to get the economic benefits of AI slightly later: after jailbreaks are solved and the models don't help anyone with bioweapons; after jailbreaks are no longer a problem, you tell them their country still can't get the economic benefits for completely unrelated reasons; you can easily turn this into something political, and the AI labs won't be on your side);
  • etc.

Your theory of change needs to have the aim of preventing x-risks. Slowing down AI is a leaky proxy, especially if you lose the impact on x-risk from your sight; please don't blindly optimise for the proxy.

It seems to be far easier to explain the problem and incentivise regulations that would address it because they’re designed to address it. There are sets of people such that if you persuade everyone in a set and they all know they all worry about the problem, you’ll get the regulation that tries to address the problem. If you do that, you’ll likely be able to send specific proposals for dealing with the problem, direct them at existing AI governance experts who would love to work on something that reduces the x-risk, and people you’ll have persuaded are likely to be able to direct resources at improving the policy proposals, while keeping them directed at addressing x-risk.

If you say that this is dangerous, it might sound attractive instead

Advocates for an AI pause often miss ways they can incentivise policymakers to move in the wrong direction.

The intelligence community will be hearing capabilities-related points ten times louder than danger-related points. We don't want anyone training large models with dangerous capabilities, even the military that wants to get those before bad actors do.

E.g., advocating for compute governance with the goal of preventing bad actors from gaining access to cutting-edge models that can hack everything, could be okay if done extremely carefully, but we don't want some of what we might say in support of regulation to incentivise state actors to invest heavily in advancing general AI.

If you don't have a gears-level understanding, don't communicate the problems unless you clearly state the epistemic status and point to the experts

One of the groups working on the advocacy had a page stating something along the lines of AI being more likely than not to be used by bad actors to hack literally everything, shut down the internet, and cause a societal collapse. If someone who it's important to persuade asked a cybersecurity expert about the threat model, the cybersecurity expert would've said that even given the assumption of AIs powerful enough to hack being trained, the story wouldn't go this way, and the authors don't know what they're talking about.[3] The page could also have incentivised people to invest in AI cyber offence capabilities. I thought all of that could've had a downside risk big enough for spending my time on this to make sense, so I talked to them, and they've made some changes.

Many people advocating for AI regulation don't understand why the problem is, indeed, a problem. They often don't have a technical background that allows them to explain the x-risk. Technical experts might say that these people don't understand AI, and they would be right.[4] People from the leading AI labs might be able to predict everything they can say and prepare replies to all their arguments.


This is really important to get right.

People working on advocacy are awesome. They're smart and caring. But I think some of their strategies carry significant downside risks.

AI Alignment won't be solved if given a couple more years.

Make actual arguments people believe in, rather than trying to play 4d PR chess.

Honest arguments for slowing down that don't include the x-risk might be fine if made extremely carefully. Be honest, be careful, do premortems of advocacy and public outreach attempts, and find people who can help you with those premortems.

  1. ^

    Not central, but possibly makes sense to address: some people might have a theory of change that their interventions will cause everyone to die a bit later than they otherwise would. This might seem pretty valuable to some people, especially if they think it's the best they can do. However, many of these interventions decrease the likelihood of preventing an existential catastrophe. If you care about future generations, and there's a meaningful chance of stumbling across a way to increase the probability of humanity surviving, then the total potential change in the probability times the sheer number of future people and the value of their lives is probably worth much more than making 8 billion people live for a couple of years longer. If a direction of work makes it harder to prevent the x-risk, maybe don't pursue it? (And if you're not a long-termist, maybe it makes more sense to donate to charities that prolong the life the most per dollar within the time limit and not focus on AI.)

  2. ^

    I expect people here to agree the labs shouldn't be able to train anything too smart until everyone agrees it is safe, so this post isn't about why this happens by default. But if you're working on alignment, and think scalable oversight might work by default, in less than 20 years, you might be losing the hard parts of the problem out of sight. I wrote a post on this.

  3. ^

    In particular, the page did not address the fact that (I think; I'm not a cybersecurity expert) people training the AI might try to find and make patches for potential vulnerabilities, and even with widely available models, an increase in AI abilities is likely to lead to increase in the number of bugs good actors can find and patch more than the number of bugs bad actors can find and exploit. It talked about a likely societal collapse caused by bad actors shutting down the internet, but bad actors don't really have an incentive to shut down the internet. With proprietary codebases, good actors get access to the ability to find and fix bugs earlier than bad actors are able to exploit their systems as black boxes.[5]

  4. ^

    Technically, I don't think anyone really understands AI, but there are varying degrees.

  5. ^

    I also think that at the point where AI is smart enough to discover important zero-days and write exploits for them much faster than humans, it's probably already powerful enough to be around the sharp left turn dynamics and the capabilities necessary to escape and then kill everyone, and you have to discover a way to align it before that happens, or it causes the deaths of everyone instead of a societal collapse.

New to LessWrong?

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 11:48 AM

Back in January, Michael Cohen talked at the House of Commons about the possibility of AI killing everyone. At this point, when policymakers want to understand the problem and turn to you, downplaying x-risk doesn't make them listen to you more; it makes them less worried and more dismissive. I think a lot of AI governance people/think tanks haven't updated on this.

(I already decided I probably want to publish it without editing/getting into the details much further, but now my laptop got stolen, so I wouldn’t be able to edit it in any case, so I published the post)

Sorry to hear your laptop was stolen :(

This seems to me an instantiation of a classic debate about realpolitik.

I disagree with the main point in this post because raising concerns over x-risk is not mutually exclusive with advocating for more palatable policies (such as requiring evals before deployment). I think the actual thing that many EAs are trying to do is to talk loudly about near term policies while also mentioning x-risk concerns to the extent that they think is currently politically useful. The aim of this is to slow down AI progress (giving us more time to find a permanent solution), gain traction within the political system and actually make AI safer (although if alignment is hard then these policies may not actually reduce x-risk directly).

Gaining knowledge, experience and contacts in AI policy making will make it easier to advocate policies that actually deal with x-risk in the future. The concern about being seen as dishonest for not raising x-risk sooner feels unrealistic to me because it is so standard in public discourse to say something not because you believe it but because it aligns with your tribe (ie. operate at higher Simulacrum Levels).

In summary

Implement as much AI regulation as you can today, while gaining influence and gradually raising the salience of x-risk so that you can implement better regulation in the future.

seems like a reasonable strategy and better than the proposed alternative of

Only communicate x-risk concerns to policy makers.

Eh, my reason for supporting regulations even if imperfect is that perfection in these things is the enemy of good - yes, you shouldn't pass rules that end up making things straight up worse, but you need to push in the Overton window the concept of treating AI as an unsafe industry that needs some oversight to begin with. Honestly if risks like an extremely fast take-off to unfathomably smart ASI and following paperclipping are allowed by the rules of this universe, I don't even know what can we do to stop it at this point; the only two alternatives would be either truly solve alignment, if it even can be done (and any slowing down gives us a small increase in the chance of that), or just stopping developing AI altogether, which is just not going to happen, politically speaking. But in scenarios in which the threat isn't quite as dire - in which take off is slow, we get warning shots, and/or existential risk emerges from a feedback loop of AGIs and social structures, then I think even relatively flawed regulation is a better starting point than nothing at all. Of course it still depends on which regulations precisely, so we'd need to go into specifics for that.

(Quick thoughts, Ive read your comment and written this one from my phone, sorry if I misunderstood something/the reply missed the points or not very concise/etc.)

I’m not sure whether your reasoning is “well, in the scenarios with what I heard Yudkowsky calls a “fast take off”, we’re dead, so let’s think about other scenarios where we have an easier time having impact”. Like, no, we live in some specific world; if this is a world where surviving is hard, better try to actually save everyone, than flinch away and hope for a “slow takeoff”.

If Meta trained an AI on hundreds of thousands of GPUs and it’s requires much less to run/can run faster at inference, and it’s as smart as smartest humans, and can find and exploit zero-day vulnerabilities, then it can do what smart/specialised humans can do, just much faster, including being able to hack literally everything. If it has a lot of GPUs to run on, it is can do even more per time, can multitask, can have a hierarchical AutoGPT kind of thing, etc., and the whole thing is smarter than humans. No self-modification, coding new AIs, etc. required. It kills everyone before it focuses on recoursive self-improving and paperclipping.

What’s your model of what you call “slow takeoff” after the AI is smarter than humans, including being better than humans at finding and exploiting zero-days? Or what’s your model for how we don’t get to that point?

I’m pretty sure that in our world, indeed, being on the level of smartest humans at hacking but running much faster and being able to use hacked machines to run even faster, means that you can hack almost literally everything, easily, and be smart enough to kill everyone.

If stopping developing generally capable AI doesn’t happen, because politicians “can never agree to this”, then, sorry to inform you, we’ll all be dead, soon. Having a couple of months more to live doesn’t change this fact. Alignment won’t be solved in time, there’s no insight we’re in a path to, that would get us a full solution to alignment. (Problems that need to be solved are disjunct enough for stumbling across a solution to one to not increase the chances of stumbling across solutions to others that much. A couple of month don’t matter.)

But this seems false.

Sure, you can predict biorisks and other warning shots and work with the govs to prevent those. If you also mention x-risk, does this really change the other things you’re informing the govs of? If your competence in other AI-related risks has already been confirmed by the experts who work for/with the government on eg biorisks, does talking about x-risk before warning shots happen make it harder to persuade them about x-risk later?

By default, warning shots make it easier to show that these things can be smart and potentially dangerous. This makes it easier to introduce some regulation. It also incentivises governments to invest heavily in these dangerous toys that can come up with offensive things. It doesn’t prevent the labs from training anything in an environment inaccessible by bad actors, and it doesn’t make the government worry about future AI systems as potential agentic bad actors. Without x-risk, there isn’t a reason to prohibit insanely valuable technology from being developed. Governments think that AIs are powerful guns that can kill but also spill out money. Protecting these guns from bad actors makes sense; protecting humanity from the guns going rogue and killing literally everyone isn’t a threat model they have at all, unless you explain it.

There are sets of people who if you persuade everyone in the group of the x-risk, you significantly increase the chances of not being dead. I don’t think it’s impossible to persuade one specific person, although it takes resources. It takes more people and more resources to persuade more people. If you want eight billion people alive today and hardly countable generations to come to live, maybe just actually try?

I’m not sure whether your reasoning is “well, in the scenarios with what I heard Yudkowsky calls a “fast take off”, we’re dead, so let’s think about other scenarios where we have an easier time having impact”. Like, no, we live in some specific world; if this is a world where surviving is hard, better try to actually save everyone, than flinch away and hope for a “slow takeoff”.

Yes, we live in some specific world, but we don't know which. We can only guess. To simplify, if I have 80% belief that we live in a slow take off world, and 20% we live in a fast take off world, and I think that strategy A has a 50% chance of working in the former kind of world and none in the latter, whereas strategy B has a 5% chance of working regardless of the world, I'll still go with strategy A because that gives me a 40% overall chance of getting out of this alive. And yes, in this case it helps that I do place a higher probability on this being a slow takeoff kind of world.

What’s your model of what you call “slow takeoff” after the AI is smarter than humans, including being better than humans at finding and exploiting zero-days? Or what’s your model for how we don’t get to that point?

Honestly I think it's a tad more complicated than that. Don't get me wrong, any world with smarter-than-humans AGI in it is already incredibly dangerous and tethering on the brink of a bunch of possible disasters, but I don't expect it to end in instant paperclipping either. There are two main thrusts for this belief:

  1. the means problem: to me, the big distinction between a slow and fast takeoff world is how much technological low-hanging fruit there exists, potentially, for such a smarter-than-humans AGI to reap and use. Whichever its goals, killing us is unlikely to be a terminal one. It can be instrumental, but then it's only optimal policy if the AI can exist independently of us. In a world in which its intelligence is easily translated into effective replacements for us for all maintenance and infrastructural needs (e.g. repair nanobots and stuff), then we're dead. In a world in which that's not the case, we experience a time in which the AI behaves nicely and feeds us robot technology to put all the pieces in places before it can kill us;

  2. the motive problem: I think that different kinds of AGI would also wildly differ in their drive to kill us at all. There are still inherent dangers from having them around, but not every AGI would be a paperclipper that very deliberately aims at removing us from the board as one of its first steps. I'd expect that from a general AlphaZero-like RL agent trained on a specific goal. I wouldn't expect it from a really really smart LLM, because those are less focused (the goal that the LLM was trained on isn't the same as the goal you give to the simulacra) and more shaped specifically by human content. This doesn't make them aligned, but it makes them I think less alien than the former example, to an extent where probably they'd have a different approach. Again, I would still be extremely wary of them - I just don't think they'd get murder-y as their very first step.

If your competence in other AI-related risks has already been confirmed by the experts who work for/with the government on eg biorisks, does talking about x-risk before warning shots happen make it harder to persuade them about x-risk later?

Ok, so to be clear, I'm not saying we should NOT talk about x-risk or push the fact that it's absolutely an important possibility to always keep in mind. But I see this more as preparing the terrain. IF we ever get a warning shot, then, if the seed was planted already, we get a more coherent and consistent response. But I don't think anyone would commit to anything sufficiently drastic on purely theoretical grounds. So I expect that right now the achievable wins are much more limited, and many are still good in how they might e.g. shift incentives to make alignment and interpretability more desirable and valuable than just blind capability improvement. But yes, by all means, x-risk should be put on the table right away.

I talked to some policy people, will talk to more; I slightly updated towards some things: e.g.,

  • it might be fine to talk about non-existential risks to get you into the room, and then talk about existential risks as well;
  • people confirmed my uncertain assumptions that there might be downside risks of reaching out to policymakers (no matter whether you talk about x-risk or not). It’s dangerous to talk to them without having consulted with people with experience in policy and lawyers who specialise on that sort of stuff, and when talking to policymakers, it’s not great not to have someone with a huge experience talking to them together with you and controlling the situation. Other risks also exist (such as stepping on someone’s toes and ruining some efforts that would’ve succeeded otherwise, there are ways it might be dangerous to unilaterally reach policymakers without any coordination with other groups);
  • with a lot of policymakers, a better way to start the conversation is to have a proposed bill at hand and talk about it and justifications for it;
  • when policymakers react (even to the point of having meeting with you) to cold outreach, their interest in the issue and willingness to engage with it/understand it might be quite limited, and it might be a huge difference compared to having the same length meeting but after being introduced etc.

Also got some updates on what are the mechanisms through which it might make sense to reach policymakers, what are the different categories of them and what sort of the material it makes sense to present to them; who to talk to; etc.

I agree with the concern about accidentally making it harder for X-risk regulations to be passed - probably also something to keep in mind for the part of the community that works on mitigating the misuse of AI. 
Here are some concerns specifically to this point which I have and am curious what people think about it: 

1. Policy Feasibility: Policymakers often operate on short-term electoral cycles, which inherently conflict with the long-term nature of x-risks. This temporal mismatch reduces the likelihood of substantial policy action. Therefore, advocacy strategies should focus on aligning x-risk mitigation with short-term political incentives. 

2. Incrementalism as Bayesian Updating: A step-by-step regulatory approach can serve as real-world Bayesian updating. Initial, simpler policies can act as 'experiments,' the outcomes of which can inform more complex policies. This iterative process increases the likelihood of effective long-term strategies. 

3. Balanced Multi-Tiered Regulatory Approach: Addressing immediate societal concerns or misuse (like deep fakes) seems necessary to any sweeping AI x-risk regulation since it seems to be in the Overton window and constituents' minds. In such a scenario, it would require significant political or social capital to pass something only aimed at x-risks but not about the other concerns. 

By establishing regulatory frameworks that address more immediate concerns based on multi-variate utility functions, we can probably lay the groundwork for more complex regulations aimed at existential risks. This is also why I think X-risk policy advocates come off as radical, robotic or "a bit out there" - they are so focused on talking about X-risk that they forget the more immediate or short-term human concerns. 

With X-risk regulation, there doesn't seem to be a silver bullet; these things will require intellectual rigour, pragmatic compromise and iterations themselves (also say hello to policy inertia).