Safety researchers should take a public stance

Ishual

I've been repeatedly loud and explicit about this but an happy to state again that racing to build superintelligence before we know how to make it not kill everyone (or cause other catastrophic outcomes) seems really bad and I wish we could coordinate to not do that.

[-]boazbarak1mo270

I disagree with a number of statements made in the post and do not support an AI development ban or pause. But I support Leo speaking his mind about this and I think it’s important for OpenAI and other labs to have a culture where employees feel free to speak about such issues.

[-]Ted Sanders1mo*14-2

I wonder if there's a palatable middle ground where instead of banning all AI research, we might get people to agree to ban in advance only dangerous types of ASI.

My current personal beliefs:
- ASI existential risk is very much worth worrying about
- Dangerous ASI is likely the #1 threat to humanity
- In the next few decades, the odds of ASI killing/disempowering us is tiny
- I feel good accelerating capabilities at OpenAI to build technology that helps more people
- I would not support a ban or pause on AI/AGI (because it deprives people of AI benefits, breaks promises, and also accumulates a compute overhang for whenever the ban is later lifted)
- I would happily support a preemptive ban on dangerous ASI

[-]Neel Nanda1mo456

In the next few decadues, the odds of ASI killing/disempowering us is tiny

I found this point surprising. Is this because of long timelines to ASI?

Regardless, while it seems very hard to implement well, I'm happy to publicly say that I am in favour of a well-implemented preemptive ban on dangerous ASI

[-]Ted Sanders1mo7-4

Yes, mostly.

I expect existentially dangerous ASI to take longer than ASI, which will take longer than AGI, which will take longer than powerful AI. Killing everyone on Earth is very hard to do, few are motivated to do it, and many will be motivated to prevent it as ASI’s properties become apparent. So I think the odds are low. And I’ll emphasize that these are my odds including humanity’s responses, not odds of a counterfactual world where we sleepwalk into oblivion without any response.

[-]David Scott Krueger (formerly: capybaralet)1mo40

why not?

[-]boazbarak1mo20

Is this a question for me? I am assuming "why not" refers to why I do not support a pause or a ban and not why I support that OpenAI employees should be feel free to speak up in support of such policies if that is what they believe.

This is a bit too complex to go into in a comment. I hope at some point to write a longer text (specifically I have a plan on doing a book review of "if anyone builds it then everyone dies", maybe together with "The AI con" and "AI snake oil") and to go there more into why I don't think the proposed policies are good. Just a matter of getting the time...

[-]Tomás B.1mo40

:chadgoose:

[-]samuelshadrach1mo30

Have you reached out to people at Deepseek or Alibaba to make the case for AI extinction risk or propose an alliance between the labs?

[-]Ishual1mo30

Thank you. I endorse (and internally implement) the norm of respecting you based on this :)

[-]Drake Thomas1mo9512

I already talk about my views pretty freely in public, but I do think people at labs tend to be a bit cagier and less blunt about these things than is ideal, so in the interests of normalizing such behavior: I think Earth's approach to AI development is incredibly reckless, a competent and sane civilization would not permit the trajectory we're on, and a global moratorium on all aspects of AI capability progress for the next few decades would be a substantial improvement over the status quo (though it's probably not my optimal policy given the background rates of non-AI x-risk and value lock-in, and I'd likely favor a controlled 5-10x slowdown with pause optionality).

I think my work on capabilities at Anthropic is probably good on net because I think it's valuable for them in particular to have greater influence, but I consider the acceleratory effects of such work to be a cost. If I thought large fractions of people at other labs felt similarly and I wasn't playing against defect-bot, I would probably quit for decision theory reasons, but in practice I don't think meaningful coalitions elsewhere are running anything like my decision process.

FWIW I pretty regularly express such views internally and do not find doing so to be particularly socially costly, though I wouldn't expect this to be true of the social incentives at other labs.

[-]Drake Thomas1mo30

a global moratorium on all aspects of AI capability progress for the next few decades would be a substantial improvement over the status quo

Saw some shrug reacts on this so wanted to elaborate a bit - I'm not super confident about this (maybe like 70% now rising to 80% the later we implement the pause), and become a lot more pessimistic about it if the moratorium does not cover things like hardware improvements, research into better algorithms, etc. I'm also sort of pricing in that there's sufficient political will to make this happen; the backlash from a decree like this if in fact most ordinary voters really hated it seems likely to be bad in various ways. As such I don't really try and do advocacy for such changes in 2025, though I'm very into preparing for such a push later on if we get warning shots or much more public will to put on the brakes. Happy to hear more on people's cruxes for those who think this is of unclear or negative sign.

[-]Neel Nanda1mo29-28

If I understand you correctly, I feel like this is all fundamentally premised on the best plan being to push for an AGI/ASI ban, and further that a major blocker is the lack of common knowledge that (a subset of) the AGI safety people at frontier labs think the whole thing should stop.

I disagree with this premise - even if an ASI/AGI ban seems nice in theory (eg, I agree with Leo's much softer and hypothetical statement below), I don't think it's remotely politically tractable, and unlikely to be implemented well even if some key stakeholders are on board, but with only realistic levels of competence. If I were to publicly advocate for specific policy outcomes, I think there would be more impactful ones, by virtue of being more likely to happen. I also do not think alignment is so fundamentally hard that we should consider all plans within the current paradigm to be doomed (I'm not sure if you believe this, but it's a common justification for the ban focused position)

As such, I disagree with the various actions you recommend lab employees to take, and do not intend to take them myself. Even if I did agree with the statements, I think that saying this kind of thing publicly is likely to either not matter (because it's not noticed) or be costly to my ability to have an impact inside the lab (because it is noticed, including by people in the lab, who are annoyed at it potentially causing issues for the lab). I also think that it is doable to have an impact within a lab, for reasons I recently described on the 80K podcast, though I'm sure depends heavily on the person, role, lab, theory of impact, etc - labs are a lot less homogenous than this post implies!

If you want to take this lack of costly signalling as reason to be suspicious of me, be my guest. But I think it would be net negative for the world for me to take, according to my understanding of things. And before concluding corruption, remember that disagreement is also a plausible hypothesis.

Zooming out, I kinda feel like people are likely to either think that, all things considered, an AGI/ASI ban is the most important thing to push for, in which case I don't see great reasons for working in a lab; or think that more incremental approaches are more promising, in which case I think that taking actions costly to their influence in the lab may be a pretty bad idea (depending heavily on their situation).

[-]So8res1mo*8065

It seems to me that most people who pay attention to AI (and especially policymakers) are confused about whether the race to superintelligence is real, and whether the dangers are real. I think "people at the labs never say the world would be better without the race (eg because they think the world won't actually stop)" is one factor contributing to that confusion. I think the argument "I can have more of an impact by hiding my real views so that I can have more influence inside the labs that are gambling with everyone's lives; can people outside the labs speak up instead?" is not necessarily wrong, but it seems really sketchy to me. I think it contributes to a self-fulfilling prophecy where the world never responds appropriately because the places where world leaders looked for signals never managed to signal the danger.

From my perspective, it's not about "costly signaling", it's about sending the signal at all. I suspect you're underestimating how much the world would want to change course if it understood the situation, and underestimating how much you could participate in shifting to an equilibrium where the labs are reliably sending a saner signal (and underestimating how much credibility this would build in worlds that eventually cotton on).

And even if the tradeoffs come out that way for you, I'm very skeptical that they come out that way for everyone. I think a world where everyone at the labs pretends (to policymakers) that what they're doing is business-as-usual and fine is a pretty messed-up world.

[-]Neel Nanda1mo12-21

I'm pretty skeptical of this, can you say more? Specifically, I don't feel like a bunch more safety employees at labs saying this kind of thing would really make a difference, in a world where:

Dario talks about 2 year timelines, Demis about 5 - more intense than many in the safety community
The 3 main labs have talked about x risk a fair bit, the statement on AI X-risk exists, etc - obviously this isn't "shut it all down" but it's still acknowledging that the danger is real
People like Daniel Kokotajlo have quit OpenAI to whistleblow, including writing AI 2027 (which imo conveys dangers and the race being real)
Luminaries like Hinton and Bengio are warning about this
You guys just wrote a popular book about it

My sense of the situation is that a bunch of people are saying this is madness, a bunch of people are saying it isn't, and that's not going to change, at best a small handful of people get added to the madness side. Even if 50% of all lab safety employees started saying stronger things than the company, re "you should shut it all down" etc, many of them genuinely disagree and would not say that, meaning even the specific "lab safety employee constituency" isn't going to get beyond being a mess of disagreement.

[-]So8res1mo*6143

I think there's a huge difference between labs saying "there's lots of risk" and labs saying "no seriously, please shut everyone down including me, I'm only doing this because others are allowed to and would rather we all stopped". The latter is consistent with the view; its absence is conspicuous. Here is an example of someone noticing in the wild; I have also heard that sort of response from multiple elected officials. If Dario could say it that'd be better, but lots of researchers in the labs saying it would be a start. And might even make it more possible for lab leaders to come out and say it themselves!

[-]Neel Nanda1mo8-5

I agree there's a big difference, my skepticism is that a handful of lab safety researchers saying this would matter, when people like Hinton say it, and lab CEOs do not (Like, I would be pretty shocked if you could get this above 50 lab employees, out of thousands total). I would be curious to hear more about the chats with elected officials, if they've led you to think differently?

[-]So8res1mo*3418

Quick take: I agree it might be hard to get above 50 today. I think that even 12 respected people inside one lab today would have an effect on the Overton window inside labs, which I think would have an effect over time (aided primarily by the fact that the arguments are fairly clearly on the side of a global stop being better; it's harder to keep true things out it the Overton window). I expect it's easier to shift culture inside labs first, rather than inside policy shops, bc labs at least don't have the dismissals of "they clearly don't actually believe that" and "if they did believe it they'd act differently" ready to go. There are ofc many other factors that make it hard for a lab culture to fully adopt the "nobody should be doing this, not even us" stance, but it seems plausible that that could at least be brought into the Overton window of the labs, and that that'd be a big improvement (towards, eg, lab heads becoming able to say it).

[-]Neel Nanda1mo83

Ah, if your main objective is to shift internal lab culture I'm pretty on board with this aim, but would recommend different methods. To me, speaking prominently and publicly could eg pose significant PR risk to a lab and get resistance, while speaking loudly in internal channels is unlikely to and may be more effective. For example, I'd be more optimistic about writing some kind of internal memo making the case and trying to share it widely/create buzz, sharing the most legit examples of current AI being scary in popular internal channels, etc. I still expect this to be extremely hard, to be risky for the cause if done badly, and to become easier the scarier AI gets, so it doesn't feel like one of my top priorities right now, but I'm much more sympathetic to the ask, and do think this is something internal lab safety teams should be actively thinking about - I definitely agree with "arguing for true things is easier", though I do not personally think "the pragmatically best solution is a global ban" is objectively true (I appreciate you writing a book trying to make this case though!)

[-]So8res1mo*4833

Oh yeah, I agree that (earnest and courageous) attempts to shift the internal culture are probably even better than saying your views publicly (if you're a low-profile researcher).

I still think there's an additional boost from consistently reminding people of your "this is crazy and earth should do something else" views whenever you are (e.g.) on a podcast or otherwise talking about your alignment hopes. Otherwise I think you give off a false impression that the scientists have things under control and think that the race is okay. (I think most listeners to most alignment podcasts or w/e hear lots of cheerful optimism and none of the horror that is rightly associated with >5% destruction of the whole human endeavor, and that this contributes to the culture being stuck in a bad state across many orgs.)

FWIW, it's not a crux for me whether a stop is especially feasible or the best hope to be pursuing. On my model, the world is much more likely to respond in marginally saner ways the more that decision-makers understand the problem. Saying "I think a stop would be better than what we're currently doing and beg the world to shut down everyone including us" if you believe it helps communicate your beliefs (and thus the truth, insofar as you're good at believing) even if the exact policy proposal doesn't happen. I think the equilibrium where lots and lots of people understand the gravity of the situation is probably better than the current equilibrium in lots of hard-to-articulate and hard-to-predict ways, even if the better equilibrium would not be able to pull off a full stop.

(For an intuition pump: perhaps such a world could pull off "every nation sabotages every other nation's ASI projects for fear of their own lives", as an illustration of how more understanding could help even w/out a treaty.)

[-]Neel Nanda1mo201

Yeah, I agree that media stuff (podcasts, newspapers etc) are more of an actual issue (though only involve a small fraction of lab safety people)

I'm sure this varies a lot between contexts, but I'd guess that at large companies, employees being allowed to do podcasts or talk to journalists on the record is contingent (among other things) on them being trusted to be careful to not say things that could lead to journalists writing hit pieces with titles like "safety researcher at company A said B!!!" (it's ok if they believe some spicy things, so long as they are careful to not express them in that role). This is my model in general, not just for AI safety

There's various framings you can do like using a bunch of jargon to say something spicy so it's hard to turn into a hit piece (eg "self exfiltration" over "escape the data center"), but there's ultimately still a bunch of constraints. The Overton window has shifted a lot, so at least for us, we can say a fair amount about the risks and dangers being real, but it's only shifted so much.

Imo this is actually pretty hard and costly to defect against, and I think the correct move is to cooperate - it's a repeated game, so if you cause a mess you'll stop being allowed to do media things. (And doing media things without permission is a much bigger deal than eg publicly tweeting something spicy that goes viral). And for things like podcasts, it's hard to cause a mess even once, as company comms departments often require edit rights to the podcast. And that podcast often wants to keep being able to interview other employees of that lab, so they also don't want to annoy the company too much.

Personally, when I'm doing a media thing that isn't purely technical, I try to be fairly careful with the spicier parts, only say true things, and just avoid topics where I can't say anything worthwhile, but trying to say interesting true things within these constraints where possible.

In general, I think that people should always assume that someone speaking to a large public audience (to a journalist, on a podcast, etc), especially someone who represents a large company, will not be fully speaking their mind, and interpret their words accordingly - in most industries I would consider this general professional responsibility. But I do feel kinda sad that if someone thinks I am fully speaking my mind and watches eg my recent 80K podcast, they may make some incorrect inferences. So overall I agree with you that this is a real cost, I just think it's worthwhile to pay it and hard to avoid without just never touching on such topics in media appearances

[-]So8res1mo4136

I am personally squeamish about AI alignment researchers staying in their positions in the case where they're only allowed to both go on podcasts & keep their jobs if they never say "this is an insane situation and I wish Earth would stop instead (even as I expect it won't and try to make things better)" if that's what they believe. That starts to feel to me like misleading the Earth in support of the mad scientists who are gambling with all our lives. If that's the price of staying at one of the labs, I start to feel like exiting and giving that as the public reason is a much better option.

In part this is because I think it'd make all sorts of news stories in a way that would shift the Overton window and make it more possible for other researchers later to speak their mind (and shift the internal culture and thus shift the policymaker understanding, etc.), as evidenced by e.g. the case of Daniel Kokotajlo. And in part because I think you'd be able to do similarly good or better work outside of a lab like that. (At a minimum, my guess is you'd be able to continue work at Anthropic, e.g. b/c Evan can apparently say it and continue working there.)

[-]Neel Nanda1mo50

Hmm. Fair enough if you feel that way, but it doesn't feel like that big a deal to me. I guess I'm trying to evaluate "is this a reasonable way for a company to act", not "is the net effect of this to mislead the Earth", which may be causing some inferential distance? And this is just my model of the normal way a large, somewhat risk averse company would behave, and is not notable evidence of the company making unsafe decisions.

I think that if you're very worried about AI x-risk you should only join an AGI lab if, all things considered, you think it will reduce x-risk. And discovering that the company does a normal company thing shouldn't change that. By my lights, me working at GDM is good for the world, both via directly doing research, and influencing the org to be safer in various targeted ways, and media stuff is a small fraction of my impact. And the company's attitude to PR stuff is consistent with my beliefs about why it can be influenced.

And to be clear, the specific thing that I could imagine being a firable offence would be repeatedly going on prominent podcasts, against instructions, to express inflammatory opinions, in a way that creates bad PR for your employer. And even then I'm not confident, firing people can be a pain (especially in Europe). I think this is pretty reasonable for companies to object to, the employee would basically be running an advocacy campaign on the side. If it's a weaker version of that, I'm much more uncertain - if it wasn't against explicit instructions or it was a one off you might get off with a warning, if it is on an obscure podcast/blog/tweet there's a good chance no one even noticed, etc.

I'm also skeptical of this creating the same kind of splash as Daniel or Leopold because I feel like this is a much more reasonable company decision than those.

[-]So8res1mo2012

The thing I'm imagining is more like mentioning, almost as an aside, in a friendly tone, that ofc you think the whole situation is ridiculous and that stopping would be better (before & after having whatever other convo you were gonna have about technical alignment ideas or w/e). In a sort of "carthago delanda est" fashion.

I agree that a host company could reasonably get annoyed if their researchers went on many different podcasts to talk for two hours about how the whole industry is sick. But if casually reminding people "the status quo is insane and we should do something else" at the beginning/end is a fireable offense, in a world where lab heads & Turing award winners & Nobel laureate godfathers of the field are saying this is all ridiculously dangerous, then I think that's real sketchy and that contributing to a lab like that is substantially worse than the next best opportunity. (And similarly if it's an offense that gets you sidelined or disempowered inside the company, even if not exactly fired.)

[-]Neel Nanda1mo60

Ah, that's not the fireable offence. Rather, my model is that doing that means you (probably?) stop getting permission to do media stuff. And doing media stuff after being told not to is the potentially fireable offence. Which to me is pretty different than specifically being fired because of the beliefs you expressed. The actual process would probably be more complex, eg maybe you just get advised not to do it again the first time, and you might be able to get away with more subtle or obscure things, but I feel like this only matters if people notice.

[-]So8res1mo*4336

Thanks for the clarification. Yeah, from my perspective, if casually mentioning that you agree with the top scientists & lab heads & many many researchers that this whole situation is crazy causes your host company to revoke your permission to talk about your research publicly (maybe after a warning), then my take is that that's really sketchy and that contributing to a lab like that is probably substantially worse than your next best opportunity (e.g. b/c it sounds like you're engaging in alignmentwashing and b/c your next best opportunity seems like it can't be much worse in terms of direct research).

(I acknowledge that there's room to disagree about whether the second-order effect of safetywashing is outweighed by the second-order effect of having people who care about certain issues existing at the company at all. A very quick gloss of my take there: I think that if the company is preventing you from publicly acknowledging commonly-understood-among-experts key features of the situation, in a scenario where the world is desperately hurting for policymakers and lay people to understand those key features, I'm extra skeptical that you'll be able to reap the imagined benefits of being a "person on the inside".)

I acknowledge that there are analogous situations where a company would feel right to be annoyed, e.g. if someone were casually bringing up their distantly-related political stances in every podcast. I think that this situation is importantly disanalogous, because (a) many of the most eminent figures in the field are talking about the danger here; and (b) alignment research is used as a primary motivating excuse for why the incredibly risky work should be allowed to continue. There's a sense in which the complicity of alignment researchers is a key enabling factor for the race; if all alignment researchers resigned en masse citing the ridiculousness of the insanity of the race then policymakers would be much more likely to go "wait, what the heck?" In a situation like that, I think the implicit approval of alignment researchers is not something to be traded away lightly.

[-]So8res1mo125

For what it's worth, I think that it's pretty likely that the bureaucratic processes at (e.g.) Google haven't noticed that acknowledging that the race to superintelligence is insane has a different nature than (e.g.) talking about the climate impacts of datacenters, and I wouldn't be surprised if (e.g.) Google issued one of their researchers a warning the first time they mentioned things, not out of deliberate sketchiness but just out of bureaucratic habit. My guess is that that'd be a great opportunity to push back, spell out the reason why the cases are different, and see whether the company stands up to its alleged principles or codifies its alignmentwashing practices. If you have the opportunity to spur that conversation, I think that'd be real cool of you -- I think there's a decent chance it would spark a bunch of good internal cultural change, and also a decent chance that it would make the issues with staying at the lab much clearer (both internally, and to the public if a news story came of it).

[-]Neel Nanda1mo73

Separate point: Even if the existence of alignment research is a key part of how companies justify their existence and continued work, I don't think all of the alignment researchers quitting would be that catastrophic to this. Because what appears to be alignment research to a policy maker is a pretty malleable thing. Large fractions of current post training are fundamentally about how to get the model to do what you want when this is hard to specify. Eg how to do reasoning model training for harder to verify rewards, avoiding reward hacking, avoiding sycophancy etc. Most people working on these things aren't thinking too much about AGI safety and would not quit, but could be easily sold to policy makers at doing alignment work. (and I do personally think the work is somewhat relevant, though far from the most important thing and not sufficient, but this isn't a crux)

All researchers quitting en masse and publicly speaking out seems impactful for whistleblowing reasons, of course, but even there I'm not sure how much it would actually do, especially in the current political climate.

[-]Neel Nanda1mo*5-4

I still feel like you're making much stronger updates on this, than I think you should. A big part of my model here is that large companies are not coherent entities. They're bureaucracies with many different internal people/groups with different roles, who may not be that coherent. So even if you really don't like their media policy, that doesn't tell you that much about other things.

The people you deal with for questions like "can I talk to the media" are not supposed to be figuring out for themselves if some safety thing is a big enough deal for the world that letting people talk about it is good. Instead, their job is roughly to push forward some set of PR/image goals for the company, while minimising PR risk. There's more senior people who might make a judgement call like that, but those people are incredibly busy, and you need a good reason to escalate up to them.

For a theory of change like influencing the company to be better, you will be interacting with totally different groups of people, who may not be that correlated - there's people involved in the technical parts of the AGI creation pipeline who I want to use safer techniques, or let us practice AGI relevant techniques; there's senior decision makers who you want to ensure make the right call in high stakes situations, or push for one strategic choice over another; there's the people in charge of what policy positions to advocate for; there's the security people; etc. Obviously the correlation is non-zero, the opinions and actions of people like the CEO affect all of this, but there's also a lot of noise, inertia and randomness, and facts about one part of the system can't be assumed to generalise to the others. Unless senior figures are paying attention, specific parts of the system can drift pretty far from what they'd endorse, especially if the endorsed opinion is unusual or takes thought/agency to conclude (I would consider your points about safety washing etc here to be in this category). But when inside you can build a richer picture of what parts of the bureaucracy are tractable to try to influence.

[-]So8res1mo90

I agree that large companies are likely incoherent in this way; that's what I was addressing in my follow-on comment :-). (Short version: I think getting a warning and then pressing the issue is a great way to press the company for consistency on this (important!) issue, and I think that it matters whether the company coheres around "oh yeah, you're right, that is okay" vs whether it coheres around "nope, we do alignmentwashing here".)

With regards to whether senior figures are paying attention: my guess is that if a good chunk of alignment researchers (including high-profile ones such as yourself) are legitimately worried about alignmentwashing and legitimately considering doing your work elsewhere (and insofar as you prefer telling the media if that happens -- not as a threat but because informing the public is the right thing to do) -- then, if it comes to that extremity, I think companies are pretty likely to get the senior figures involved. And I think that if you act in a reasonable, sensible, high-integrity way throughout the process, that you're pretty likely to have pretty good effects on the internal culture (either by leaving or by causing the internal policy to change in a visible way that makes it much easier for researchers to speak about this stuff).

[-]Richard_Ngo1mo375

FWIW I used to agree with you but now agree with Nate. A big part of the update was developing a model of how "PR risks" work via a kind of herd mentality, where very few people are actually acting on their object-level beliefs, and almost everyone is just tracking what everyone else is tracking.

In such a setting, "internal influence" strategies tend to do very little long-term, and maybe even reinforce the taboo against talking honestly. This is roughly what seems to have happened in DC, where the internal influence approach was swept away by a big Overton window shift after ChatGPT. Conversely, a few principled individuals can have a big influence by speaking honestly (here's a post about the game theory behind this).

In my own case, I felt a vague miasma of fear around talking publicly while at OpenAI (and to a lesser extent at DeepMind), even though in hindsight there were often no concrete things that I endorsed being afraid of—for example, there was a period where I was roughly indifferent about leaving OpenAI, but still scared of doing things that might make people mad enough to fire me.

I expect that there's a significant inferential gap between us, so this is a hard point to convey, but one way that I might have been able to bootstrap my current perspective from inside my "internal influence" frame is to try to identify possible actions X such that, if I got fired for doing X, this would be a clear example of the company leaders behaving unjustly. Then even the possible "punishment" for doing X is actually a win.

[-]Neel Nanda1mo15-3

I guess speaking out publicly just seems like a weird distraction to me. Most safety people don't have a public profile! None of their capabilities colleagues are tracking the fact that they have or have not expressed specific opinions publicly. Some do, but it doesn't feel like you're exclusively targeting them. And eg If someone is in company wide slack channels leaving comments about their true views, I think that's highly visible and achieves the same benefits of talking honestly, with fewer risks.

I'm not concerned about someone being fired for this kind of thing, that would be pretty unwise on the labs' part as you risk creating a martyr. Rather, I'm concerned about eg senior figures thinking worse of safety researchers as a whole because it causes a PR headache, eg viewing them as radical troublemakers, and this making theories of impact around influencing specific senior decision makers harder (and I'm more optimistic about those, personally)

[-]Ishual1mo*40

Rather, I'm concerned about eg senior figures thinking worse of safety researchers as a whole because it causes a PR headache, eg viewing them as radical troublemakers, and this making theories of impact around influencing specific senior decision makers harder (and I'm more optimistic about those, personally)

Thank you Neel for stating this explicitly. I think this is very valuable information. This matches what some of my friends told me privately also. I would appreciate it a lot if you could give a rough estimate of your confidence that this would happen (ideally some probability/percentage). Additionally, I would appreciate if you could say whether you'd expect such a consequence to be legible/visible or illegible (once it had happened). Finally, are there legible reasons you could share for your estimated credence that this would happen?

(to be clear: I am sad that you are operating under such conditions. I consider this evidence against expecting meaningful impact from the inside at your lab.)

[-]Neel Nanda1mo20

It's not a binary event - I'm sure it's already happened somewhat. OpenAI has had what, 3 different safety exoduses by now, and (what was perceived to be) an attempted coup? I'm sure leadership at other labs have noticed. But it's a matter of degree.

I also don't think this should be particularly surprising - this is just how I expect decision makers at any organisation that cares about its image to behave, unless it's highly unusual. Even if the company decides to loudly sound the alarm, they likely want to carefully choose the messaging and go through their official channels, not have employees maybe going rogue and ruining message discipline. (There are advantages to the grassroots vibe in certain situations though). To be clear, I'm not talking about "would take significant retaliation", I'm talking about "would prefer that employees didn't, even if it won't actually stop them"

[-]Ishual1mo10

This sounds to me like there would actually be specific opportunities to express some of your true beliefs that you wouldn't worry would cost you a lot (and some other opportunities where you would worry and not do them). Would you agree with that?

[-]Ishual1mo10

(optional: my other comment is more important imo)

I'm not concerned about someone being fired for this kind of thing, that would be pretty unwise on the labs' part as you risk creating a martyr

I think you ascribe too much competence/foresight/focus/care to the labs. I'd be willing to bet that multiple (safety?) people have been fired from labs in a way that would make the lab look pretty bad. Labs make tactical mistakes sometimes. Wasn't there a thing at OpenAI for instance (lol)? Of course it is possible(/probable?) that they would not fire in a given case due to sufficient "wisdom", but we should not assign an extreme likelihood to that.

[-]Neel Nanda1mo20

Yeah, agreed that companies sometimes do dumb things, and I think this is more likely at less bureaucratic and more top down places like OpenAI - I do think Leopold went pretty badly for them though, and they've hopefully updated. I'm partly less concerned because there's a lot of upside if the company makes a big screw up like that.

[-]Lukas Finnveden1mo55

This is roughly what seems to have happened in DC, where the internal influence approach was swept away by a big Overton window shift after ChatGPT.

In what sense was the internal influence approach "swept away"?

Also, it feels pretty salient to me that the ChatGPT shift was triggered by public, accessible empirical demonstrations of capabilities being high (and social impacts of that). So in my mind that provides evidence for "groups change their mind in response to certain kinds of empirical evidence" and doesn't really provide evidence for "groups change their mind in response to a few brave people saying what they believe and changing the overton window".

If the conversation changed a lot causally downstream of the CAIS extinction letter or FLI pause letter, that would be better evidence for your position (though also consistent with a model that put less weight on preference cascades and model the impact more like "policymakers weren't aware that lots of experts were concerned, this letter communicated that experts were concerned"). I don't know to what extent this was true. (Though I liked the CAIS extinction letter a lot and certainly believe it had a good amount of impact — I just don't know how much.)

[-]Lukas Finnveden1mo53

As such, I disagree with the various actions you recommend lab employees to take, and do not intend to take them myself.

It's not clear that you disagree that much? You say you agree with leo's statement, which seems to be getting lots of upvotes and "thanks" emojis suggesting that people are going "yes, this is great and what we asked for".

I'm not sure what other actions there are to disagree with. There's "advocate internally to ensure that the lab lets its employees speak out publicly, as mentioned above, without any official retaliation" — but I don't really expect any official retaliation for statements like these so I don't expect this to be a big fight where it's costly to take a position.

[-]Neel Nanda1mo66

To me, Leo's statement is much weaker than what the post is asking people to say - it's saying "conditional on us not knowing how to make ASI without killing everyone, it would be nice if we could coordinate on not racing to do it" - as a literal statement this seems obviously reasonable (eg someone who thinks we know how to make ASI safely, or will easily figure it out in time, could also agree with this, and someone who strongly opposes any kind of governmental AGI/ASI ban could agree with it, or even someone who thinks that in reality labs should do nothing but race; though I know Leo's actual views are stronger than this)

To me this is not advocating for an AGI ban, as an actual practical political request, it's just saying "in theory, if we could coordinate, it would be nice". The post is saying things I consider much stronger like:

out publicly[2] against the current AI R&D regime and in favor of an AGI ban

prefer an AGI ban[1] over the current path

I am kinda confused by Leo's comment being so highly upvoted, if it was genuinely all the authors of this post wanted then I suggest they write a different post, since I found the current one far more combative.

[-]habryka1mo4-2

Leo is saying:

I've been repeatedly loud and explicit about this but an happy to state again that racing to build superintelligence before we know how to make it not kill everyone (or cause other catastrophic outcomes) seems really bad and I wish we could coordinate to not do that.

The "before" here IMO pretty clearly (especially in the context of the post) communicates "we do not know currently how to do this, we should currently not be racing, and I support efforts to stop racing at substantial cost". Maybe I am wrong in that interpretation, if so I do think it's indeed so weak as to not really mean anything.

[-]Neel Nanda1mo20

we do not know currently how to do this

Agreed this is implied

we should currently not be racing

I believe Leo believes this, and it's somewhat implied by the statement, though imo that statement is also consistent with eg "there's a 50% chance that by the time we make AGI we'll have figured how to align ASI, therefore it's fine to continue to that point and then we can stop if need be or continue"

I wish we could coordinate to not do that.

I support efforts to stop racing at substantial cost

To me the latter doesn't seem the important claim, "should we take significant cost" doesn't feel like a crux, rather "would this work at all" feels like my crux, and "I wish" reads to me like it's side stepping that. This was the key bit of what Leo said that I consider much softer than what the post asks for.

I'm also not sure the former implies the latter, but that gets messier - eg if you think that we will figure out how to align ASI after 6-12 months of pause at the right time, it doesn't seem very costly to push for that. While if you think it will take at least 20 years and might totally fail, maybe it does. I consider the statement to be agnostic on this.

To be clear, I know Leo's actual beliefs are on the doomy end, but I'm trying to focus on what I think the statement actually says, and what I meant when I called it obviously reasonable.

Other positions I consider consistent with the statement:

It would be ridiculous to pause now, instead we should wait until we have good automated safety researchers under good control schemes, then pause for 6-12 months to solve alignment, then continue
There's absolutely no way to coordinate with China, there's a 90% chance ASI is fine by default, and China getting to ASI first would be catastrophic, so the US should race to beat China and roll the dice
Pushing for a ban is a terrible idea, because it will lead to a hardware overhang and could easily end too early for us to solve the problem, resulting in us having less time to practice on proto-AGI systems

[-]Ishual1mo30

if it was genuinely all the authors of this post wanted then I suggest they write a different post

Leo's statement is quite good without being all we wanted. (indeed, of the 3 things we wanted, 1 is about how we think it makes sense for others to relate to safety researcher based on what they say/[don't say] publicly. and 1 is about trying to shift the lab's behavior toward it being legibly safe for employees to say various things, which Leo's comment is not about.) I internally track a pretty crucial difference between what I want to happen in the world (ie that we shift from plan B to plan A somehow) and how I believe people ought to relate to the public stance/[lack thereof] of safety researchers within frontier labs. I think there are maybe stronger stances Leo could have taken, and weaker ones, and I endorse having the way I relate/model/[act towards] Leo depend on which he takes. I think the public stance that would max lead to me maximally relating well to a safety researcher ought to be something like "I think coordinating to stop the race (even if in the form of some ban which I won't choose the exact details of) would be better than the current race to ever more capable AI. I would support such coordination. I am currently trying to make the situation better in case there is no such coordination, but I don't think the current situation is sufficiently promising to justify not coordinating. Also there is a real threat of humanity's extinction if we don't coordinate." (or something to that effect)

[-]the gears to ascension1mo42

I think that saying this kind of thing publicly is likely to either not matter (because it's not noticed) or be costly to my ability to have an impact inside the lab (because it is noticed, including by people in the lab, who are annoyed at it potentially causing issues for the lab)

I appreciate you being willing to say this in worlds where you believe it.

[-]Cole Wyeth1mo282

I support a ban.

[-]Lucius Bushnaq1mo*2610

I’ve said this many times in conversations, but I don’t think I’ve ever written it out explicitly in public, so:

I support some form of global ban or pause on AGI/ASI development. I think the current AI R&D regime is completely insane, and if it continues as it is, we will probably create an unaligned superintelligence that kills everyone.

[-]Caleb Biddulph1mo269

Just want to point out that even if you think the proposal of an AI pause is too unrealistic or extreme, there's a wide range of possible public statements you could make. I think the important thing is not that all safety-minded lab employees advocate for an AI pause in particular, but that they feel comfortable honestly stating their views even if they disagree with their employer.

If a bunch of people at a frontier lab tweeted their honest opinions about AI risk and got fired shortly thereafter, I would expect that to be huge news, in a way that would outweigh the negative impact of those people not working at the lab anymore. (Huge enough that I expect they would not in fact be fired.)

I also wouldn't want people to be peer-pressured into making statements that are more extreme than their actual views, but I think we're pretty far from that world.

[-]Jiro1mo1-2

I also wouldn’t want people to be peer-pressured into making statements that are more extreme than their actual views, but I think we’re pretty far from that world.

That's because there isn't a norm for safety researchters to take the public stance described here. Once such a thing became a norm, peer pressure into making extreme statements, and generally threats to force them to make extreme statements, would be common.

Look at what we have now with all sorts of social justice statements.

[-]David Scott Krueger (formerly: capybaralet)1mo209

Huge thanks to all the lab employees who stated their support for an AI moratorium in this thread!

Can we make this louder and more public? This is really important for the public to understand.

[-]J Bostock1mo159

Strong upvoted. I struggle to imagine a company which punishes its employees for speaking the truth, but where the company leadership decides to become sane and listen to them when shit hits the fan.

[-]Tao Lin1mo122

I support a magically enforced 10+ year AGI ban. It's hard for me to concretely imagine a ban enforced by governments, because it's hard to disentangle what that counterfactual government would be like, but I support a good government enforced AGI slowdown. I do like it when people shout doom from the rooftops though, because it's better for my beliefs to be closer to global average average, and the global discourse is extremely far from overshooting doominess.

[-]evhub1mo112

I'll just link to where I've talked about this in the past here and here and here. I think I still stand by everything I said in those comments.

[-]Ishual1mo41

I agree that people should clearly state that they think there's a catastrophic risk, but I disagree that people should clearly state that they think we should pause.

If we premise (as this post does) on the fact that the person we are talking about actually believes that an international ban would be a great improvement over the current mad AI race, then the above quote seems wrong to me.

I agree that experts should not pretend like they have more authority than they do in judging whether we should pause. But they could still say 1) that the race is insane, 2) that an international ban seems like a great improvement, 3) that if such a ban was proposed, they would not oppose it and 4) they would in fact support it. If not the experts, then who? To be clear, I don't think the experts within the lab racing to build the tech are necessary here (this is not what the post is about). There are experts outside of the lab also (and they don't have the [huge conflicts of interest]/pressure to filter(/falsify?) their speech). But if not the experts, then who would be better placed to say the above? if there is no one to say it, how does it get understood? if it doesn't get understood, coordination to actually move out of the status quo towards some kind of international agreement is much harder. The CEOs of some of the lab could say it and that would definitely have an impact, but will they (lol)? Politicians could say it, but probably the backing of many experts would make this much easier for the politicians to say.

I think "there are catastrophic risks" is way too weak and doesn't substitute. Partly because "there are catastrophic risk, so please give more money to me/so put me in charge/so we must beat those less careful folks" are also possible readings. I also happen to have it on very good authority that some politicians, when informed that many experts recognize the risks of extinctions and told the reasons why we should stop the mad AI race, will ask "but do the experts support stopping?" with perhaps a side of ("or do they just want more money for their thing?")

[-]Kabir Kumar1mo90

I agree, generally - I think they'll be tricked into doing capabilities anyways though. Imo, better to reject the offer and say why.

[-]Haiku1mo10

You're only saying that because there is an extraordinarily consistent history of this exact thing happening over and over again! That empirical evidence bears no weight in comparison to my personal mental models. You clearly don't understand how virtuous and clever I am. Real AI lab safety work and inside political maneuvering has never been tried!

[-]abstractapplic1mo81

I think I technically count as one of those? It's not my day job, but I contributed a task to METR's Long Tasks paper, and I've made minor contributions to a handful of other AI-Safety-ish papers.

Anyway, if it counts: I support a ban as well. (I don't have a very high p(doom), but I don't think it needs to be very high to be Too High.)

[-]Vika1mo40

Similarly to Leo, I think racing to AGI is bad and it would be good to coordinate not to do that. I support proposals for AI regulations that would make this easier. I signed various open letters to this effect on AI red lines, AI Treaty, SB1047, and others.

I'm pretty uncertain if pushing for an AI pause now is an effective way to achieve this, and I think it's quite plausibly better to pause later rather than now. In the next few years, we will have more solid evidence of misalignment, and we would be able to make better use of a pause period (which is likely to be finite) e.g. with automated alignment researchers. I don't think calling for a pause/ban now is a costless action - early calls for a pause have the risk of crying wolf and using up the political will that could be used for a pause later. I signed the FLI pause letter in 2023, but looking back it seems a bit premature. A conditional pause in the future seems much easier to get adopted than a hard pause now.

I agree with everything Neel said in his top-level comment, and I'm puzzled by the number of disagreement votes on it.

[-]Ishual1mo30

I think that signing sufficiently clear open letters/similar things seems sufficient to count as "taking a public stance".

[Here is a first rough attempt at expressing an idea:]
I don't think pushing for an AI pause now is what most people have in mind (definitely it doesn't match what I had in mind when writing the post, keeping in mind that the post isn't about what kind of public stance it would be effective for experts (inside and outside the frontier labs) to take). Instead, what matters imo is to have a legible thing (eg open letter/statement) that says clearly that you think a coordination to stop the current race dynamic would be good if feasible (because the current race has serious risks of extinction/disempowerment). Let me try to make the distinction (between that and the pause letter) clearer:

The CAIS statement "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war." clearly establishes that extinction from AI is a thing many experts worry about. It was signed in 2023. It didn't cry wolf. It is still useful today (probably more useful today than when it came out). If you need to convince a policy maker or a member of the general public that extinction is not a fringe concern, there is a clear statement you can quote, a simple link you can share, and a list of names that speaks for itself. In practice, this info is evergreen I think.

When a policy maker considers whether they should pay attention or discuss a "pause", they have their concerns, and they are capable of worrying about whether this is "crying wolf" without our help. But maybe they wonder "would experts actually oppose me if I spoke about this?" or "do experts actually think some international coordination is required?" especially if they have experience with international coordination, I think they will suspect it is very hard to pull off, so they might just think that pushing for alignment project X to get marginally more funding is a better use of their time.

> I signed the FLI pause letter in 2023, but looking back it seems a bit premature.

I basically agree. But expressing the fact that plan A would be (a lot) better than plan B (along with some clarity on why) is useful even if it doesn't cause an immediate shift from plan B to plan A. There is a strong sense in which expressing "plan A >> plan B, because xrisks" is a lot more your job as an expert than timing a pause. In other words, I think experts should make future cooperation easier rather than try to unilaterally create a shelling point.

if we are fully ignoring the internal pressures to not say such things within frontier labs, and purely focus on the efficacy of taking a public stance, I think there is a way to take a public stance without incurring the cost you mention above.

If in the future there is some serious discussion toward an international agreement of some kind, my guess is that it will have been made possible/[much easier] by individual people clearly and credibly expressing that they'd be on board with implementing such an agreement, and I think experts could be helpful here (by clearly expressing something like "plan A >> plan B, because xrisks").

[-]Rohin Shah1mo1-13

Moreover, this strategy does not involve any costly signals that would make the statement of intent credible. How can we know (at the point where we choose whether to enforce the norm), absent additional information, that making the lab's outcome marginally better by being on the inside is their true motivation, where a similarly credible explanation would be that their actual motive (whether they are consciously aware of it or not) is something like a fun job with a good salary (monetary or paid in status), that can be justified by paying lip service to the threat models endorsed by those whose trust and validation they want (all of which are fine in themselves/isolation, but not justifying contributing to summoning a demon). To go even further with that, it allows people to remain strategically ambiguous, so as to make it possible for people of different views/affiliations to interpret the person as "one of my people".

I would say that I aim not to give in to threats (or “norm enforcement”) but you don’t even have a threat! I think you may want to rethink your models of how norm enforcement works.

[-]Ishual1mo60

Seeing the post as a threat misses the intended point. It is important to state explicitly: The goal of the three norms argued for in the post was never to force people to publicly support something they don't in fact believe in. It was also never to force people to be more honest about what they believe. The post explicitly says what we think you should be doing, so that there can be a discussion about it. But the norm enforcement part is about what we think others (who are not necessarily working at frontier labs) should be doing.

Separately, I am not sure I understood what you meant by "I aim to not give in to 'norm enforcement'", but it seems to me that there is a culture inside the labs that make many people working there uncomfortable taking a public stance. To be more explicit, does that also activate your will to not give in to 'norm enforcement'? (if not, why not?)

> I think you may want to rethink your models of how norm enforcement works.

I didn't get what you were trying to communicate here. Continuing to rethink (publicly) my models of how norm enforcement works is why we wrote this post on LW.

[-]Rohin Shah1mo1-7

But the norm enforcement part is about what we think others (who are not necessarily working at frontier labs) should be doing.

A threat by proxy is still a threat.

[-]Ishual1mo20

I conclude from this that you really do see this post as a threat (also you admitted there is no threat in your first comment so this comment now seems contradictory/bad-faith).

some thoughts:
- this isn't a threat by proxy and isn't a threat (but if it were a tbp then it would be a t sure)
- I am in the "others" group. I implement the norm I endorse in the post, and I am not threatening you. I don't want to sound dismissive but you are not giving me a lot to work with here, and it sounds to me like either 1) you have a vague model of what a threat is that includes things that aren't threats or 2) you are misunderstanding the post and our intent such that you model us as having made a threat.

we say what we think you should do as a safety researcher. not a threat. it is a recommendation.
separately, we say how we think others should relate to safety researchers in a way that is more robust and functional. Maybe I should clarify that if safety researchers don't take a public stance you find acceptable, you shouldn't be sad that they "called your bluff" (because I don't endorse you bluffing or threatening). You should not be doing this to change individual safety researchers actions. You should be doing this for the benefit of being less foolable and choosing where you put your respect/friendship in a way more functional for society and more beneficial for you. I would endorse this part of the norm even if not a single additional safety researcher took a public stance (heck, even if some of them tried to invert my preferences by removing their public stance, even then would I endorse this norm. partly because I endorse not giving in to actual threats, but also because it would still be a good norm to have on net).

[-]Zack_M_Davis1mo50

How do you think norm enforcement works, other than by threatening people who don't comply with the norm?

[-]Ishual1mo83

I probably should have said "norm execution" (ie follow the norm). This might just be a cultural gap, but I think norm enforcement/execution/implementation works in many ways that are not threats. For instance, there is pizza at a conference. there is a norm that you shouldn't take all the pizza if there is a big line behind you. some people break this norm. what happens? do they get threatened? no! they just get dirty looks and people talking behind their backs. maybe they get the reputation as the "pizza taker". In fact, nobody necessarily told them before this happened that taking all the pizza would break the norm.

I think there is a strange presumption that one is owed my and others' maximum respect and friendship. anything less than that would be a "punishment". that is pretty strange. if I have money in my pocket but I will only give some to you based on how many "good deeds" I have seen you do, this is not a threat. I guess that if you did not understand the motives or if the motives were actually to get a specific person to do more "good deeds" (by telling them in advance what the reward would be), you could call it a bribe. but calling it a threat is obviously incorrect.

I think norm enforcement/execution/implementation can and is in my case motivated by an aesthetic preference for "points" that are person A to give such as respect and friendship 1) not go to someone who does not deserve them (in my eyes) and instead 2) go to someone who does deserve them. It is not primarily driven by a consequentialist desire for more people to do respect-and-friendship-deserving things. It is primarily driven by a desire for the points to match reality, and thus enable greater cooperation and further good things down the line.

I realized based on a few comments that the three norms I discuss in the post were seen by some as like one giant strategy to produce more public stances from safety researchers. This is not the case. I am just talking to three different audiences and I explain a norm that I think makes sense (independently) for them.

[-]Rohin Shah1mo-1-3

I conclude from this that you really do see this post as a threat (also you admitted there is no threat in your first comment so this comment now seems contradictory/bad-faith).

Sure, I'll correct it to "an attempted threat by proxy is still an attempted threat". (It's not a threat just because you have nothing I care about to threaten me with, but it would be a threat if I did care about e.g. whether you respect me.)

But I agree that I am not trying to cooperate with you, if that's what you mean by bad faith.

^{^}

More generally, ban on whatever sort of AI they expect to be pursued and lead to human extinction.

^{^}

How to speak out publicly: Maybe say it in the comments? Maybe write your own post about it? Maybe say it on podcasts? Maybe if someone says some high profile version of the idea, stand behind them? Probably if you don't do any of these, you are not speaking out publicly in our eyes (but you can reach out and we will maybe include your thing in the comments)

If your colleagues can't tell if you'd prefer a ban to racing, you are not speaking out publicly.

More precisely, we think you should speak out both publicly and legibly to outsiders.

^{^}

They should either take a public stance that plan A (coordinating not to build existentially dangerous AI) is significantly higher in their preference ordering than plan B (making the current race marginally less bad) or say separately that "plan A good" and "plan B bad".

^{^}

Or more directly, a lack of negative social consequences for doing a very naughty thing.

^{^}

In the wild, this might instead take the form of people not actually changing their worldview, but of severing their morality from their actions (unless the action is only seen by people who share the worldview).

^{^}

It is reasonable to doubt that people will really coordinate. but if you do not say that you will coordinate, you are making coordination harder. if not you, then who will enable coordination?

^{^}

Perhaps because this corporation contains a lot of people with incentives (monetary/hedonic) to not really get it, or to not really support you in group discussions, and few people trying to do what you are trying to do.

^{^}

Although, we’d endorse you making this cost as low as possible. There is a consistent persona that might cut through some of the vibes and lose less respect from your lab-mates (themselves working on capabilities maybe), which is to be clear that you won’t advocate internally for a voluntary thing that you wouldn’t also publicly support on all companies, and that you’ll race with them until the outside world decides to stop, and you’ll support stopping externally meanwhile. You are on their team, “if any lab must win, let it be us,” but you think this is a mad race and you’d prefer if all the labs were stopped across the globe.

LESSWRONG
LW

LESSWRONG
LW

237

Safety researchers should take a public stance

237

237

TL;DR

Quotes

The default strategy of marginal improvement from within the belly of a beast

Noble intention murphyjitsu

The need for a better strategy