[Co-written by Mateusz Bagiński and Samuel Buteau (Ishual)]
Many X-risk-concerned people who join AI capabilities labs with the intent to contribute to existential safety think that the labs are currently engaging in a race that is unacceptably likely to lead to human disempowerment and/or extinction, and would prefer an AGI ban[1] over the current path. This post makes the case that such people should speak out publicly[2] against the current AI R&D regime and in favor of an AGI ban[3]. They should explicitly communicate that a saner world would coordinate not to build existentially dangerous intelligences, at least until we know how to do it in a principled, safe way. They could choose to maintain their political capital by not calling the current AI R&D regime insane, or find a way to lean into this valid persona of “we will either cooperate (if enough others cooperate) or win the competition in style (otherwise)”.
X-risk-concerned people who have some influence within AI capabilities labs should additionally truthfully state publicly and advocate internally to ensure that the lab lets its employees speak out publicly, as mentioned above, without any official retaliation. If they are unable to make a lab follow this policy, they should state so publicly.
X-risk-concerned people in our communities should enforce the norm of praising the heroism of those who [join AI capabilities labs while speaking out publicly on the current mad race], and being deeply skeptical of the motives of those who [join without publicly speaking out].
Not being public about one's views on this hinders the development of common knowledge, nearly guarantees that the exposure to corrupting influence from working inside the lab (which doesn’t depend on whether they publicly speak out) partially reshapes one into a worse version of oneself, and gives an alibi[4] to people who want to join labs for other reasons that would otherwise be condemned by their community.
Liron: "Do you really think that we should be encouraging people to go work at [these frontier labs]?"
Rob: "Do you think anyone who understands and cares about these [risks from superintelligence] should not be in the room where they can affect what actually happens?"— Rob Miles on Doom Debates
Rob Wiblin: Should people who are worried about AI alignment and safety go work at the AI labs? There’s kind of two aspects to this. Firstly, should they do so in alignment-focused roles? And then secondly, what about just getting any general role in one of the important leading labs?
Zvi Mowshowitz: This is a place I feel very, very strongly that the 80,000 Hours guidelines are very wrong. So my advice, if you want to improve the situation on the chance that we all die for existential risk concerns, is that you absolutely can go to a lab that you have evaluated as doing legitimate safety work, that will not effectively end up as capabilities work, in a role of doing that work. That is a very reasonable thing to be doing.
— Zvi Mowshowitz on the 80,000 Hours podcast
The reasoning exemplified in the above quotes can often be heard in the circles concerned with AI X-risk (or even AI safety more broadly), including those who think that we are on a bad trajectory, tending towards an existential catastrophe, and that a saner trajectory would involve coordinating to pause the development of AI that may lead to capabilities sufficient for an existential catastrophe, at least until we figure out whatever needs to be figured out, to ensure that that kind of AI has a robustly good impact.
The motivation of ensuring that "there be good people in the room" (if truthful), in itself, is noble and virtuous. It makes a lot of sense from a perspective that is largely focused on the marginal and tractable impact, as is the stable of, among others, practical/as-applied EA philosophy.
However, this strategy carries a great risk. Once a person enters the monster's belly, the monster becomes capable of gradually constraining the person's degrees of freedom, so that, at each point, it is "locally rational" for the person to continue to work, business-as-usual, while their agency is gradually being trimmed and shaped, so as to better serve the monster. The ambitious positive impact that was initially intended erodes into "I am one of the few good guys, and if I leave, a worse guy is gonna replace me, so I should stay and do whatever I can on the margin (even if what I'm doing now is very far from what I initially intended).". This can take more corrupted/pernicious forms as well, such as the person's worldview and/or values[5] actually adapting to the new situation, so as to rationalize their prior behavior.
You have a moral obligation not to let it happen. The world's fate is at stake[6].
Moreover, this strategy does not involve any costly signals that would make the statement of intent credible. How can we know (at the point where we choose whether to enforce the norm), absent additional information, that making the lab's outcome marginally better by being on the inside is their true motivation, where a similarly credible explanation would be that their actual motive (whether they are consciously aware of it or not) is something like a fun job with a good salary (monetary or paid in status), that can be justified by paying lip service to the threat models endorsed by those whose trust and validation they want (all of which are fine in themselves/isolation, but not justifying contributing to summoning a demon). To go even further with that, it allows people to remain strategically ambiguous, so as to make it possible for people of different views/affiliations to interpret the person as "one of my people".
There is a claim that by being on the inside, by being promoted, by befriending people within the labs, you will get an opportunity to steer these labs somewhat. "You just have to play the long game and win friends and influence people, and maybe at a critical point you will be able to get this corporation to do counterfactually better." There is an implied claim that what you would be there to do is to advocate for small changes of bearable costs to be adopted voluntarily and unilaterally by a lab trying to win a race to superintelligence. There is also an implied claim that, in expectation, you will do this sufficiently that it offsets the direct negative impact of your labor on the race. More importantly, it also offsets your reinforcement of the frame that these companies are being responsible and have "good people" working there (and therefore, we need not coordinate for something better). Another additional justification is "if not me, then someone worse", or "if not this careful lab, then some careless lab", and "if not the USA, then China" (and therefore we cannot coordinate for something better).
Just like extraordinary claims require extraordinary evidence, strategies from a reference class of strategies that can have very large negative effects require very good justifications for expecting them not to succumb to one of the failure modes of strategies in this reference class.
First, how surprised would you truly be if you found out that in the default future, you either quit with regret for joining, or ended up just getting along and not pushing as hard as you thought you would for incremental change?
Second, how surprised would you truly be if your words ended up not being enough? Ended up not steering this corporation as much as you expected[7]?
Sometimes even heroes just have to play the game, and sometimes they just have to be hawkish, stare oblivion in the face, and keep doing the bad thing, potentially even with the burning intent to win, until the others come to their senses and also support plan A (which is to ban Superintelligence at least until we mostly agree that we know what we are doing, and can do progress in sane non-racing conditions).
But there is no chance of magically coordinating around plan A without common knowledge of support/wish for coordination around plan A.
The USA did not unilaterally reduce its nuclear arsenal. There was a lot of hawkishness, a lot of will to actually build more nuclear warheads if needed. But people also clearly signaled support for doing something better and more sane. A clear, sober intent to go for a coordinated solution, and actually enforce it, and make it so that the other side doesn't cheat, but nevertheless, a clear intent to not cheat oneself, and to just go for plan A, to make it an option, if at all possible.
If you're joining an org that is, in your assessment, ~net-negative because it seems like your role is actually locally good, you should run this assessment by people whose epistemics you trust, so that they can red team the hell out of it, especially given that "apparently locally good positions within an EvilCorp" are an effective lure for people in a reference class that includes ambitiously benevolent LessWrongers, Effective Altruists, etc.
Making plan A (coordination around not building X-risk-posing AI) requires a sufficient buildup of common knowledge. Building common knowledge requires speaking publicly about what is sane to do, even if — especially if — on your own, you are pursuing a plan B that superficially seems to hypocritically go against the grain of the plan A you are publicly supporting. The default Schelling point is "Rabbit", not "Stag", and this will not change unless the widespread desire to "hunt the Stag" becomes common knowledge.
To show that you actually care about reducing AI X-risk, state publicly that you would support coordination around not building dangerous ASI, that not building it is plan A, and that whatever you're doing inside the lab is either plan B (if plan A does not succeed), or building science and technology that you expect to be helpful if plan A is well-accomplished. The ~personal costs imposed by such a statement make it a credible signal of commitment[8].
An optional emotional dialogue on betrayal
(This section is about emotions. If you are cringing and running away after reading this sentence, this is not meant for you, and I’d encourage you to skip.)
I [Ishual / Samuel Buteau] have had many private discussions with friends that went essentially as below. If you recognize yourself, please know that you are far from alone, and I am pointing out a dynamic with an anonymized example, which is not about you personally.
Me: So, at least you should publicly state you’d rather we reached an international agreement so that the race could stop.
Friend: I don’t think I can do that. I don’t think you understand how much face I’d lose with my colleagues.
Me: This makes no logical sense. If you are trying to signal loyalty to EvilCorp (or LeastBadAI), calling for all players to be bound by what you ask EvilCorp to voluntarily do is strictly more loyal.
Friend: I understand that it is *logically* more loyal, but the vibes are all wrong, and my colleagues are not reasonable about this. They will just react very poorly if I say anything.
Me: It sounds to me like you don’t understand the depth of the betrayal I feel here. I think that no matter how unreasonable your colleagues are, I am reasonably very upset that you won’t even do this. It feels like defection. I don’t think your tiny incremental improvements to safety at EvilCorp will matter, and you don’t think my attempt at international cooperation will matter. But the difference is that you are shooting my hopes in the face, and I am accepting that you have to go work at EvilCorp and try your best. I am just asking you to stop shooting my hope in the face! You are willing to accommodate your colleagues so much more than (to accommodate) me. Am I really asking for so much?
Friend: I think you are asking me to maybe get fired here.
Me: Do you know how fucked up it would be if they fired you over this?
Friend: Fucked up things happen lol (you should know that!)
Me: Yes, but if the culture is so utterly irredeemable internally that you are worried about getting fired over vibes despite logically being more on their side than if you just nag about voluntary burdens they should take on, … I don’t even know what to say. Don’t you think the world has a right to know? Don’t you think outsiders would care? Don’t you think maybe EvilCorp would have to not fire you? Don’t you think the impact you’d have on putting the world on a safer path would be bigger than what you’ll have from within this dysfunctional culture?
Friend: … I don’t know, man, let's talk about it in a few months.
Me: Look, I get that the incentives are not on my side here. I get it. I just want you to know that many people on the outside would have your back if you got fired. And maybe it is all a mirage that you’d be dispelling, and you’d have many people at EvilCorp at your side also.
More generally, ban on whatever sort of AI they expect to be pursued and lead to human extinction.
How to speak out publicly: Maybe say it in the comments? Maybe write your own post about it? Maybe say it on podcasts? Maybe if someone says some high profile version of the idea, stand behind them? Probably if you don't do any of these, you are not speaking out publicly in our eyes (but you can reach out and we will maybe include your thing in the comments)
If your colleagues can't tell if you'd prefer a ban to racing, you are not speaking out publicly.
More precisely, we think you should speak out both publicly and legibly to outsiders.
They should either take a public stance that plan A (coordinating not to build existentially dangerous AI) is significantly higher in their preference ordering than plan B (making the current race marginally less bad) or say separately that "plan A good" and "plan B bad".
Or more directly, a lack of negative social consequences for doing a very naughty thing.
In the wild, this might instead take the form of people not actually changing their worldview, but of severing their morality from their actions (unless the action is only seen by people who share the worldview).
It is reasonable to doubt that people will really coordinate. but if you do not say that you will coordinate, you are making coordination harder. if not you, then who will enable coordination?
Perhaps because this corporation contains a lot of people with incentives (monetary/hedonic) to not really get it, or to not really support you in group discussions, and few people trying to do what you are trying to do.
Although, we’d endorse you making this cost as low as possible. There is a consistent persona that might cut through some of the vibes and lose less respect from your lab-mates (themselves working on capabilities maybe), which is to be clear that you won’t advocate internally for a voluntary thing that you wouldn’t also publicly support on all companies, and that you’ll race with them until the outside world decides to stop, and you’ll support stopping externally meanwhile. You are on their team, “if any lab must win, let it be us,” but you think this is a mad race and you’d prefer if all the labs were stopped across the globe.