I've been thinking about the role of outsiders over the next decade as we transition to ASI. By "outsiders", I mean the AI safety community outside the frontier labs, plus some other actors trying to make things go well. This is in contrast to other groups like: (i) employees at the frontier labs ("insiders"), (ii) senior government officials and policymakers, and (iii) the general public.[1]
My current views:
Outsiders are a key component of things going well. In this article, I go through the main reasons I expect this (with some percentages representing the strengths of the reasons).
There's a bunch of activities outsiders might be doing. We should maintain optionality over activities and be prepared to pivot quickly.
The best prioritisation over these activities will depend on a rapidly evolving and unpredictable strategic landscape, e.g. "help a clueless US government understand the takeover risks from the current deployment", or "develop cheap techniques which can be exported to unreasonable labs", or "verify claims made by labs to a very clued-in international community".
The current zeitgeist seems to be "general managers", i.e. orgs that focus on one topic. By topic, I mean things like "AI epistemics", "white-box control", "reward hacking". I think this is good, despite the need to pivot between activities, because you can pivot between activities while staying in the same topic. E.g. you can pivot between "developing training techniques for eliciting good advice from AIs" and "embarrass labs by publicising about low quality AI epistemics" and "help governments decide which AI model to use for strategic advice" etc.
I'm excited about ensuring outsiders have access to resources — e.g. information, compute, funding, headcount, model access. This seems robust to different activities.
I think outsiders should maintain a reputation for epistemic integrity, i.e. saying what we believe and why we believe it. This seems like an important asset in the most impactful activities. It seems more important than (e.g.) acting friendly with labs.
How do outsiders make things better?
Why do I expect that outsiders are a key component of things going well? Below are some reasons I've thought about. The percentages capture something like: How much is the reason that outsiders are a key component of things going well? But the exact numbers are mostly vibes.
In short: there's a bunch of activities that outsiders seem well-suited for, compared to safety-minded lab employees. The best things for us to work on seems pretty tied to how things unfold during the next couple years. There are also some minor considerations.
Outsiders currently do a plurality of good work [5%]
Epistemic checks on labs [5%]
Mitigating risks from reasonable labs [4%]
Mitigate risks from the unreasonable labs — Plan D labs [15%], Plan E labs [3%]
Increase reasonableness of the labs — Plan D labs [15%], Plan E labs [5%]
Slowing unreasonable labs, differentially — Plan D labs [3%], Plan E labs [3%]
Neutral third-parties [9%]
Achieving government buy-in [5%]
Better futures stuff [10%]
Minor considerations
Outsiders amortise across labs. [1%]
Labs are often a mess. [2%]
Switching costs across labs. [1%]
Power concentration and coups. [4%]
Other/unknowns unknowns/heuristics [10%]
I'll go through these in more detail.
Outsiders currently do a plurality of good work.
I think that, in terms of making AI go well, the top 500 outsiders currently outperform any particular frontier lab or government. Naively extrapolating into the future, then the outsiders are the primary force making things go well.
I'm pretty sure this extrapolation is an overestimate. But I nonetheless expect outsiders will remain a big chunk of the value-weighted work. Relative to the leading lab, my guess is the outsider impact might drop from like 200% today to 50% during crunch-time. The drop is due to three trends:
Trend 1: Many outsiders will probably join the labs, becoming insiders, which implies that outsiders will become relatively less important. However:
Many outsiders won't join the labs. This is true even if, from your perspective, you think they are making a mistake by not joining. This is due to personal reasons, their own biases, etc.
In shortish timelines, outsiders might not have enough time to join the labs. This is because it might be slow for outsiders to (i) recognise they should join the labs, (ii) offboard from their orgs, and (iii) onboard into the labs.
There is also a coordination problem: if everyone else in your org is remaining at the org, then maybe your best move is to also remain, but you would all be better off joining the labs. (I think one solution here is the labs acquiring outsider orgs wholesale.)
Trend 2: The insiders will probably become more productive due to better funding, compute access, and model access.
I think this is a big deal, and the main thing cutting against the importance of outsiders. That said, I don't want to accept this as an immutable exogenous constraint. I think it's worthwhile to fight for better funding, compute, and model access for the outsiders.
Trend 3: In the final months before handoff, the most important work will be best-suited to insiders, i.e. implementing safety techniques, assigning tasks to the AIs, etc.
This is also a big deal. But I still think there will be enough work for outsiders to do before that point.
Epistemic checks on labs
I think it's very plausible that, if labs build misaligned AIs that take over, then many of the insiders had a genuine but incorrect belief that the AI wouldn't take over based on evidence that was actually flimsy and misleading. In general, a huge proportion of risk comes from scenarios where a bunch of stuff happens really quickly, with decisions made by an insular group of people in a terrible epistemic environment with no external checks. (See also Daniel Kokotajlo on this.)
I'm personally unsure how bad the labs are as epistemic environments. Some people tell me they are pretty bad and getting worse, but I don't know any egregious examples.
Here are some mechanisms you might worry about:
Selection effects — e.g. optimists joined labs earlier, and gained more influence, and pessimists were purged or sidelined.
Treatment effects — e.g. groupthink, financial incentives, echo chamber, social conditioning, status-y stuff
"Is our AI aligned?" is the kind of question where it's easy for the lab to get caught up in random dumb beliefs, especially in the presence of some motivated reasoning and selective argumentation.
You might also be worried about lab leadership lying to the insiders, or withholding information, to avoid internal drastic action. This gets worse as lab employees lose influence due to R&D automation, and as teams grow increasingly siloed.
Things looks a lot better if there is a decent state of public scientific understanding of what's going on. This probably requires outsiders leading the conversation, because traditional academia and media will probably drop the ball.
Outsiders could independently assess the risks of AI development as it goes on and point out when evidence for imminent catastrophic risk becomes very strong.
If the insiders knew that their claims would be scrutinised by the outsides, then they would probably feel more pressure to believe defendable claims.
On the other hand, you might expect the labs to become better epistemic environments, due to a combination of: (i) taking risks more seriously as the stakes increase, (ii) having access to R&D automation. But I think it's quite likely that the epistemics would grow dangerously poor absent the outsiders.
Mitigating risks from reasonable labs
Consider a hypothetical lab making good choices by our lights. They share our worldview on AI risks, and they're broadly aiming to achieve near-best futures. That said, they still have constraints from external investors; they still consist of ordinary humans with bounded rationality; and some of the non-leadership employees might be unreasonable.
How should outsiders reduce the chance that the reasonable lab deploys a catastrophic AI?
I think outsiders probably aren't well-positioned to do this, because the insiders will be so much more productive.[3] The insiders are probably 3-100x more productive at mitigating risks from the lab. So if outsiders want to mitigate risks from the reasonable lab, they should probably join the lab.
That said, the strategic landscape is unpredictable, so I wouldn't be surprised if there will be a bunch of useful stuff for outsiders to do:
Activities with low uplift. The insiders probably have better model access than the outsiders. Hence, if tasks can be automated with AI (e.g. coding, empirical reasearch, verifiable tasks) then it's probably not helpful for ousiders to add their own labour, because this labour will be much less productive than the insider labour on those tasks. However, there might be some tasks which can't be automated with AI ( e.g. macrostrategy, agent foundations, conceptual work, research taste, hard-to-verify tasks). Maybe the outsiders could do that and export to the lab.
Generating training data. Maybe the lab wants to train the models on certain domains, then it might be worthwhile for outsiders to generate that data. In the extreme, this might involve outsiders talking to AIs a lot.
Sensitive research. Maybe the labs are leaky, so it's better for certain sensitive research to happen within a highly secure outsider organisation. The lab might be leaky due to: spies, or hacking from external adversaries, hacking from the AI agents, or simply Bay Area house parties.
Sabotage hedge. This is a bit galaxy-brained. Maybe outsiders could run parallel versions of the research that's happening in the labs. This provides an external check that the research isn't being sabotaged by scheming AIs or scheming employees.
Forcing function for reduced-permission infra. This is also galaxy-brained. If outsiders do a bunch of research outside the labs, then labs are incentivised to provide tooling which allows interacting with the AIs using reduced permissions, e.g. Tinker. This infra might mitigate threats from scheming AIs or spies. See Buck's The Thinking Machines Tinker API is good news for AI control and security.
Mitigating risks from unreasonable labs
What if the lab is unreasonable? How should outsiders reduce the chance that the unreasonable lab deploys a catastrophic AI? I think this looks somewhat different if there are a decent number of safety-minded insiders, versus if there are none. Slightly abusing Ryan's taxonomy,[4] we could call these "Plan D" labs and "Plan E" labs.
Plan D. The lab won't take reasonable measures to reduce catastrophic deployment, at least by default. However, there are 10-30 safety-minded insdiers who can direct ~3% compute. See also Ten people on the inside.
What should outsiders do?
Export safety tech to safety-minded insiders. The 10-30 insiders will be overwhelmed by implementing safety techniques (even with AI uplift) that they won't have time to invent and refine the techniques themselves. So outsiders might do this, running experiments on open-source models and public APIs.
Sabotage hedge. This is less galaxy-brained in a Plan D lab. The insiders might be pretty worried that the research is being sabotaged, e.g. because a schemer is running a rogue internal deployment and interfering with the safety-minded insiders' experiments.
Sensitive research. This matters more in the Plan D lab. The insiders might worry that the rest of the lab will take their findings and use it for capabilities. So they might want outsiders to do safety research with significant capability externalities.
Increasing the reasonableness. I'll discuss this in the next section. This most likely reduces the risks from the lab's deployment. But I want to note that, theoretically, increasing the reasonableness of the lab could increase the risks from the lab's deployment, so long as this was outweighed by a bigger reduction in the risks from rivals. For example: maybe the lab is deploying their AIs with too many safeguards, and the outsiders should encourage a trusted handoff to reduce x-risk elsewhere.
But these don't seem great. Probably the outsiders should try to join the 10-30 safety-minded insiders. Or they should join a reasonable lab to do the activities above.
Plan E labs. The lab has no safety-minded insiders, and is dismissive of catastrophic risks. How can outsiders mitigate the risks from the lab? I'm not optimistic about any of these:
Hardening — Maybe the outsiders can try to "harden" the external world, so that a rogue internal deployment is less likely to lead to AI takeover. This might involve cyberdefence, biodefense, anti-manipulation tech, d/acc, etc. I'm not optimistic about this because you'd probably need to harden the external world against an ASI, without an ASI yourself.
Deals — Potentially, you could try to negotiate with the misaligned AIs. I'm not optimistic about this, because the negotiating position looks so bad for the outsiders. But it's possible this works if (1) the AIs are worried about the Plan E lab being disempowered by a rival, or (2) the AIs within the Plan E lab don't share common values.
Negative alignment tax — Potentially, the outsiders could search for safety techniques which are more competitive than what the lab is currently using. The lab might implement this themselves. I'm very pessimistic about this. Firstly, the lab is probably ahead of the outsiders, and so it'll be hard to beat them at their own game. More importantly: this would only mitigate risks from AIs trained with that technique, but the lab is happy to deploy those AIs to build the maximally capable successor, so you're relying on those AIs finding a negative-alignmet-tax technique, and-so-on. That sounds absurd to me.
Increasing reasonableness of the labs
It's probably better if the labs were making wiser and more altruistic decisions. This involves:
Improving epistemics — e.g. better auditing techniques, risk reports, etc.
Improving incentives — e.g. communicating risks to the public, the government, safety-minded lab employees; increasing the weight that the lab places on good motivation (e.g. mitigating risk from their own deployment, mitigating risks from their rivals) compared with bad motivations (e.g. superficial appearance of safety, making lots of money, prestige).
In practice, improving the lab's epistemics and improving their incentives looks pretty similar. That's because we can often improve their incentives by improving the epistemics of other actors (the public, the government, etc) — the hope is that, if these other actors had a better sense of what's going on, then they would take (drastic) actions in their interest, and this would be good. Of course, you might face a regress problem, where you also want to improve the incentives of those other actors, such that they're incentivised to take good actions.
Plan D. If the lab has 10-30 safety-minded insiders, then here's what outsiders might do to make the lab more reasonable:
Incrimination tech — This includes better alignment auditing, better honeypotting, better sandbagging detection, better classifiers for malign action, etc. This seems good, but probably better left to insiders, because it will enjoy decent amount of AI uplift.
Dangerous capability evals and scary demos — e.g. UK AISI's cyberevals. This is well-suited to outsiders, because insiders might not be trusted to run the evals. The insiders might be suspected of sandbagging on the evals (to avoid stringent regulation) or exaggerating performance (to generate hype).
Scrutinising safety cases — If the lab makes arguments for why their deployment risk is low, then outsiders might generate counterarguments. They can communicate these counterarguments to lab leadership (to improve epistemics), and to the government and public (to improve incentives). Unfortunately, I think it'll grow increasingly difficult to improve the lab's incentives by scrutinising their safety cases, because the government and public will have a harder time understanding the arguments.
Evidence of non-catastrophic harms — This might be more promising. Here, you gather examples of non-catastrophic harms caused by the lab (e.g. jailbreaks), and use this to embarass the labs. The hope is that, if the lab puts more effort into mitigating the non-catastrophic harms, then this would spill over into mitigating catastrophic harms as well. You can gather these examples naturally, or by actively trying to elicit them (red-teaming).
Case study: jailbreaks. Arguably, lab's prioritised preventing jailbreaks because external researchers tried to jailbreak their models (e.g. UK AISI, FAR.ai, academics, people on twitter). It’s possible that if you wanted labs to improve jailbreak robustness, then working externally on jailbreaking models would’ve been competitive with working internally on improving robustness.
Forecasting and sensemaking — Paradigm examples include things like AI 2027, Europe 2031, etc. The goal here is to explain the risks in a concrete and vivid way. I think these have been pretty successful. I would also include things like METR time-horizons, and EpochAI's work. This improves epistemics within the lab, and also the incentives of the lab by encouraging more government and public oversight.
Third-party risk reports — I'm including here risk assessments of particular models, safety techniques, or deployments. The paradigm example is METR's frontier risk report. This incentivises labs to implement safety techniques that are costly but reduce risk, for fear of criticism from employees, the public, government, and other actors.
Empowering the 10-30 insiders. This might involve joining the unreasonable lab, so they have more headcount, or helping them out in other ways, e.g. subsidising their compute.
Slowing down unreasonable labs, differentially. I'll discuss this more below. Basically, this involves setting up obstacles which require acting reasonably to overcome. If there enough obstacles then this has a selection effect (i.e. the reasonable labs have an advantage and pull ahead). But it also has a treatment effect (i.e. labs become reasonable in order to overcome the obstacles). So this is helpful for increasing the reasonableness of the labs.
Do outsiders have a comparative advantage? My guess is that insiders have a comparative advantage for improving epistemics, and the outsiders for improving incentives. Overall, I think a good dynamic is having safety-pilled insiders communicating risks to lab leadership, and outsiders communicating risks to policymakers and the public.
This is because improving incentives is adversarial, and outsiders are better positioned to be adversarial to labs:
The lab has leverage over insiders, e.g. employment, and equity (see the equity clawback situation with Daniel Kokotajlo).
The insiders are probably less trusted to be neutral by the government and the public.
The outsiders have freedom to communicate candidly with the government and the public, whereas insiders have signed NDAs.
However, I think insiders could also be well-positioned for adversarial comms, if they think they have enough leverage (e.g. social connections, prestige and respect, threatening to quit and join competitors, threatening to whistleblow, etc).
Plan E. If the lab has no safety-minded insiders, then outsiders need to behave differently:
Obviously, less focus on helping the insiders increase reasonableness (because there are no insiders).
Outsiders should spend more time communicating risks directly to lab leadership, because there are no insiders covering this. But I'm pessimistic about this working, because the outsiders will lack knowledge about how the models are trained.
Developing incrimination tech looks not great, because the lab won't implement it. Maybe the outsiders could try incriminating the AIs via their observable effects on the outside world (e.g. searching for rogue external deployments) or by honeypotting them via public APIs. But I'm not optimistic about this, because the AIs probably don't need to do anything incriminating outside the lab's datacentre.
Making legal challenges against the lab is still promising, but probably too slow.
Slowing unreasonable labs, differentially
If a reasonable lab is leading ahead of unreasonable labs, then we should try to slow down the unreasonable labs. This gives the reasonable lab a longer lead-time to burn on safety, and makes it harder for the unreasonable lab to overtake.
If a reasonable lab is trailing behind the unreasonable labs, then we should try to slow down the unreasonable labs. This gives the reasonable lab a chance to overtake, and gives them a better bargaining position in a negotiation.
Ideally, this would happen impartially. That is, the outsiders set up obstacles which require acting reasonably to overcome.
Plan D. Suppose we want to slow down a Plan D lab. Here's what outsiders can do:
Trigger existing commiments and regulations. The Plan D lab might've made various commitments to their own employees, investors, governments, e.g. to stop scaling unless bla occurs. The outsiders can try to demonstrate that bla has not yet occurred, which spurs the relevant counterparty to act. For example, they might join the relevant regulatory body, or advise a legal challenge. The outsiders are better positioned than insiders because they can be more outspoken.
Push for shutdown. The outsiders can try to push the government to reverse a deployment (cf. Fable) or shut down the unreasonable lab entirely. The outsiders might be better positioned because they are trusted by the public and the government — especially if the labs are deeply unpopular by this point. Although potentially this ask would sound more scary if it came from the insiders. I'm not sure. Probably a combination is best.
Plan E. This look similar to above, except:
The lab probably isn't bound by any commitments or regulations. Potentially there would still be time to push hard on getting these, but I would be pessimistic, unless you could first increase the reasonableness of the lab.
If the Plan E lab is sufficiently terrible, it might be worthwhile for the outsiders to speed up the more reasonable rival labs, e.g. by working on capabilities.
Neutral third-parties
Suppose we head for a world with a high amount of political will, such as Plan A or Plan B. Again, this comes from Ryan's taxonomy.
Plan A: There is enough will for some sort of strong international agreement that mostly eliminates race dynamics and allows for slowing down (at least for some reasonably long period, e.g. 10 years) along with massive investment in security/safety work.
Plan B: The US government agrees that buying lead time for US AI companies is among the top few national security priorities (not necessarily due to misalignment concerns) and we can spend 1-3 years on mitigating misalignment risk.
I'll discuss how outsiders help in Plan B and then in Plan A.
How outsiders help in Plan B? If there is buy-in from the government, then there will be a critical role for neutral third-parties, in shaping and enforcing the government's plan for AI development.
Examples:
Auditing the models for misalignment or secret loyalties
Third-party risk assessments (3PRA)
Designing new regulations/standards/orders
Evaluating labs against those regulations/ standards/orders
Verifying claims made by the labs to the government
I think that these activities are best done by neutral parties, without an affiliation with (or financial stake in) a frontier lab. Even a former affiliation might be disqualifying.[5] This might be a bigger problem in the future because (some or all) labs may be deeply unpopular with the public or the governments.
Of course, the third-parties probably can't be completely neutral — e.g. they might still rely on the labs for compute, model access, and other support. And they will still have social connections that might hinder their neutrality perception. And the outsiders probably have preferences about which lab takes the lead. But I think this can be mitigated by the third-party using transparent, replicable processes, and being auditable by other actors.
How outsiders help in Plan A? I think the case here is much weaker. I think it's unlikely that the US and China would both trust the outsiders to act neutral between them. That said, if there is an international agreement, then this probably requires actors which are neutral among the US labs to coordinate with China. And the outsiders seem well-positioned to form those parties.
And the outsiders are much less important in Plan C-E because the coordination and verification would occur between the labs directly (e.g. labs agreeing to audit each other’s models, or share safety research, or mutually slow down).
Achieving government buy-in
I discussed above how outsiders might be useful in Plan B (and somewhat in Plan A) as neutral third-parties. What about for achieving a Plan B in the first place? That is, how can outsiders gain enough government buy-in for serious government involvement to make a big difference to the strategic picture?
Activities that seem good here:
Preempting Plan A/B. As we saw above, there's a bunch of activities that outsiders should expect to do in Plan A/B. They could start doing that now, e.g. start designing tamper-proof GPUs because you think that will be useful in Plan A/B. This increases the perception that Plan A/B is likely to work, because some of the technical challenges have already been met, and there's a legible ecosystem poised to address further challenges. Outsiders have the advantage in this preemption work, because they will be the ones who would do this "for real" in an actual Plan A/B.
Improving epistemics of the government. We discussed this a little in "Increasing the reasonabless of the labs". Here are the best things:
Dangerous capability evals and scary demos, e.g. UK AISI's cyberevals
Evidence of non-catastrophic harms, e.g. jailbreaks, manipualtion, etc
Forecasting and sensemaking, e.g. AI 2027, EpochAI, METR.
Third-party risk reports, e.g. METR.
Improving incentives of the government.
Communicating risks to the public, so the government feels they would lose support by failing to take appropraite measures. Outsiders have an advantage here because the labs are not perceived as neutral, and might be deeply unpopular in the future.
Supporting the electoral campaigns of reasonable people.
Reducing the cost to the labs of government buy-in, so they expend less effort in lobbying against this. For example: designing regulations which impose less cost on labs per unit of safety.
Better futures stuff
Some outsiders are doing things other than directly avoiding AI takeover, which are still important for achieving a near-best future. Examples:
Cosmic resource allocation — This is probably non-puntable, because we need to bargain over cosmic resources before a single actor has secured a high BATNA. If a scope-sensitive actor thinks they have a 50% chance of grabbing 50% of the cosmic resources, they won't accept an allocation of 10%.
AI welfare — My guess is that this is puntable, but I'm excited for outsiders to scan for non-puntable interventions and execute on them.
Power concentration — I'm imagining both threat modelling and research. But also object-level stuff (OSINT, investigative journalism, and active resistance to unfolding coups).
Gradual disempowerment — There's probably a bunch of stuff for outsiders to do if gradual disempowerment becomes an issue. See here.
Post-ASI governance — Much of this work might be non-puntable, e.g. What should be the limitations on manipulating other humans values and beliefs? Should we limit the velocity of Von Neumann probes? etc. Outsiders can forecast threats, suggest proposals, and implement the best ones.
Speculative considerations — This includes object-level work on acausal, ECL, simulators, infinite ethics, etc. I'm excited for outsiders to scan for non-puntable interventions and execute on them.
Automated conceptual reasoning — This makes all the above go better. Labs might do this by default, but they might not. And maybe outsiders can help here, e.g. developing better techniques for eliciting conceptual reasoning. Even if the labs are excited by this, they might need outsiders as a trusted source of labelling, e.g. maybe the outsiders are just discussing macrostrategy all day, so labs can train models on the transcripts.
Minor considerations
Here are other miscellaneous considerations I’ve seen for how outsiders are helpful.
Outsiders amortise across labs. Research produced by outsiders can be exported to all the frontier labs. By contrast, research produced by insiders mostly stays within the lab. This is because labs are quite paranoid about competitivness risks, and there might be a bit of a not-invented-here bias. (Of course, labs often share safety research — this is good for hiring, and I think it's incentivised by the game theory.) Overall, there might be a 20% haircut to any research done inside a lab, due to making it harder to export.
Labs are often a mess. We've already discussed that labs have poor incentives, and might be epistemically compromised. I can imagine that labs are defective in other ways, just because large bureaucracies are often messy. This might include: dysfunctional infra, a silly internal policy, corporate drama, political retaliation, a wacky leadership. Of course, outsider orgs might also suffer these issues, but there are more of them which is healthier. This might mean that outsiders might be more productive than insiders.
Switching costs across labs. We don't know which company will end up in the lead. The outsiders might find it easier to onboard into the leading lab, compared with employees at the trailing lab who might face more friction.
Power concentration. Heuristically, it seems that if the outsiders are doing a bunch of stuff around AI, then this seems like it helps reduce concentrations of power, compared with the same work happening within a lab or within government.
Here some questions on my mind, which I might discuss in future articles:
What should outsiders be doing, and when? How should they prioritise their time? In particular, what activities are best left to safety-minded insiders?
How important are outsiders? Do they provide a marginal benefit, or are they the primary force making things go well?
Should outsiders remain outsiders (as opposed to joining the frontier labs)? Should safety-minded insiders become outsiders (as opposed to remaining at the frontier labs)?
How can we maximise the impact of the outsiders? In particular: How can we ensure the outsiders know what they need to know? How can we ensure the outsiders work productively through the transition, especially with the benefits of AI uplift? This includes having enough headcount, funding, compute, model access. How can we ensure the outsiders have enough influence over the frontier labs, governments, and other parts of the ecosystem? How can we ensure the outsiders stay on-target? This includes having good values, fixing bad incentives, and generally being virtuous.
“In the meantime it will have become very hard for you to learn from anybody who doesn't have these clearances. Because you'll be thinking as you listen to them:
'What would this man be telling me if he knew what I know? Would he be giving me the same advice, or would it totally change his predictions and recommendations?'
And that mental exercise is so torturous that after a while you give it up and just stop listening.
I've seen this with my superiors, my colleagues….and with myself. You will deal with a person who doesn't have those clearances only from the point of view of what you want him to believe and what impression you want him to go away with, since you'll have to lie carefully to him about what you know. In effect, you will have to manipulate him. You'll give up trying to assess what he has to say. The danger is, you'll become something like a moron. You'll become incapable of learning from most people in the world, no matter how much experience they may have in their particular areas that may be much greater than yours."
Why? They know the available techniques; they know which techniques have been implemented; they have access to unpublished data on those techniques; they have more compute to study the techniques; they have the permissions to choose which techniques to implement; they have the model access to assist in the implementation.
See here for Ryan's taxonomy. I'm slighly abusing his terms, because his "Plan D" and "Plan E" refers to scenarios where the leading lab have ~10 safety-minded insiders and no safety-minded insiders, respectively. But I'm using these as adjectives for a lab, regardless of whether it's leading or not.
"Obviously what happened is Burns was bumped because of his association with Anthropic. A dumb but predictable own goal. A lib admin would have done the same to an xAI technical safety researcher, assuming any of those still exist." — Dean Ball
I've been thinking about the role of outsiders over the next decade as we transition to ASI. By "outsiders", I mean the AI safety community outside the frontier labs, plus some other actors trying to make things go well. This is in contrast to other groups like: (i) employees at the frontier labs ("insiders"), (ii) senior government officials and policymakers, and (iii) the general public.[1]
My current views:
How do outsiders make things better?
Why do I expect that outsiders are a key component of things going well? Below are some reasons I've thought about. The percentages capture something like: How much is the reason that outsiders are a key component of things going well? But the exact numbers are mostly vibes.
In short: there's a bunch of activities that outsiders seem well-suited for, compared to safety-minded lab employees. The best things for us to work on seems pretty tied to how things unfold during the next couple years. There are also some minor considerations.
I'll go through these in more detail.
Outsiders currently do a plurality of good work.
I think that, in terms of making AI go well, the top 500 outsiders currently outperform any particular frontier lab or government. Naively extrapolating into the future, then the outsiders are the primary force making things go well.
I'm pretty sure this extrapolation is an overestimate. But I nonetheless expect outsiders will remain a big chunk of the value-weighted work. Relative to the leading lab, my guess is the outsider impact might drop from like 200% today to 50% during crunch-time. The drop is due to three trends:
Trend 1: Many outsiders will probably join the labs, becoming insiders, which implies that outsiders will become relatively less important. However:
Trend 2: The insiders will probably become more productive due to better funding, compute access, and model access.
I think this is a big deal, and the main thing cutting against the importance of outsiders. That said, I don't want to accept this as an immutable exogenous constraint. I think it's worthwhile to fight for better funding, compute, and model access for the outsiders.
Trend 3: In the final months before handoff, the most important work will be best-suited to insiders, i.e. implementing safety techniques, assigning tasks to the AIs, etc.
This is also a big deal. But I still think there will be enough work for outsiders to do before that point.
Epistemic checks on labs
I think it's very plausible that, if labs build misaligned AIs that take over, then many of the insiders had a genuine but incorrect belief that the AI wouldn't take over based on evidence that was actually flimsy and misleading. In general, a huge proportion of risk comes from scenarios where a bunch of stuff happens really quickly, with decisions made by an insular group of people in a terrible epistemic environment with no external checks. (See also Daniel Kokotajlo on this.)
I'm personally unsure how bad the labs are as epistemic environments. Some people tell me they are pretty bad and getting worse, but I don't know any egregious examples.
Here are some mechanisms you might worry about:
Here's how outsiders improve the situation:
On the other hand, you might expect the labs to become better epistemic environments, due to a combination of: (i) taking risks more seriously as the stakes increase, (ii) having access to R&D automation. But I think it's quite likely that the epistemics would grow dangerously poor absent the outsiders.
Mitigating risks from reasonable labs
Consider a hypothetical lab making good choices by our lights. They share our worldview on AI risks, and they're broadly aiming to achieve near-best futures. That said, they still have constraints from external investors; they still consist of ordinary humans with bounded rationality; and some of the non-leadership employees might be unreasonable.
How should outsiders reduce the chance that the reasonable lab deploys a catastrophic AI?
I think outsiders probably aren't well-positioned to do this, because the insiders will be so much more productive.[3] The insiders are probably 3-100x more productive at mitigating risks from the lab. So if outsiders want to mitigate risks from the reasonable lab, they should probably join the lab.
That said, the strategic landscape is unpredictable, so I wouldn't be surprised if there will be a bunch of useful stuff for outsiders to do:
Mitigating risks from unreasonable labs
What if the lab is unreasonable? How should outsiders reduce the chance that the unreasonable lab deploys a catastrophic AI? I think this looks somewhat different if there are a decent number of safety-minded insiders, versus if there are none. Slightly abusing Ryan's taxonomy,[4] we could call these "Plan D" labs and "Plan E" labs.
Plan D. The lab won't take reasonable measures to reduce catastrophic deployment, at least by default. However, there are 10-30 safety-minded insdiers who can direct ~3% compute. See also Ten people on the inside.
What should outsiders do?
But these don't seem great. Probably the outsiders should try to join the 10-30 safety-minded insiders. Or they should join a reasonable lab to do the activities above.
Plan E labs. The lab has no safety-minded insiders, and is dismissive of catastrophic risks. How can outsiders mitigate the risks from the lab? I'm not optimistic about any of these:
Increasing reasonableness of the labs
It's probably better if the labs were making wiser and more altruistic decisions. This involves:
In practice, improving the lab's epistemics and improving their incentives looks pretty similar. That's because we can often improve their incentives by improving the epistemics of other actors (the public, the government, etc) — the hope is that, if these other actors had a better sense of what's going on, then they would take (drastic) actions in their interest, and this would be good. Of course, you might face a regress problem, where you also want to improve the incentives of those other actors, such that they're incentivised to take good actions.
Plan D. If the lab has 10-30 safety-minded insiders, then here's what outsiders might do to make the lab more reasonable:
Do outsiders have a comparative advantage? My guess is that insiders have a comparative advantage for improving epistemics, and the outsiders for improving incentives. Overall, I think a good dynamic is having safety-pilled insiders communicating risks to lab leadership, and outsiders communicating risks to policymakers and the public.
This is because improving incentives is adversarial, and outsiders are better positioned to be adversarial to labs:
However, I think insiders could also be well-positioned for adversarial comms, if they think they have enough leverage (e.g. social connections, prestige and respect, threatening to quit and join competitors, threatening to whistleblow, etc).
Plan E. If the lab has no safety-minded insiders, then outsiders need to behave differently:
Slowing unreasonable labs, differentially
If a reasonable lab is leading ahead of unreasonable labs, then we should try to slow down the unreasonable labs. This gives the reasonable lab a longer lead-time to burn on safety, and makes it harder for the unreasonable lab to overtake.
If a reasonable lab is trailing behind the unreasonable labs, then we should try to slow down the unreasonable labs. This gives the reasonable lab a chance to overtake, and gives them a better bargaining position in a negotiation.
Ideally, this would happen impartially. That is, the outsiders set up obstacles which require acting reasonably to overcome.
Plan D. Suppose we want to slow down a Plan D lab. Here's what outsiders can do:
Plan E. This look similar to above, except:
Neutral third-parties
Suppose we head for a world with a high amount of political will, such as Plan A or Plan B. Again, this comes from Ryan's taxonomy.
I'll discuss how outsiders help in Plan B and then in Plan A.
How outsiders help in Plan B? If there is buy-in from the government, then there will be a critical role for neutral third-parties, in shaping and enforcing the government's plan for AI development.
Examples:
I think that these activities are best done by neutral parties, without an affiliation with (or financial stake in) a frontier lab. Even a former affiliation might be disqualifying.[5] This might be a bigger problem in the future because (some or all) labs may be deeply unpopular with the public or the governments.
Of course, the third-parties probably can't be completely neutral — e.g. they might still rely on the labs for compute, model access, and other support. And they will still have social connections that might hinder their neutrality perception. And the outsiders probably have preferences about which lab takes the lead. But I think this can be mitigated by the third-party using transparent, replicable processes, and being auditable by other actors.
How outsiders help in Plan A? I think the case here is much weaker. I think it's unlikely that the US and China would both trust the outsiders to act neutral between them. That said, if there is an international agreement, then this probably requires actors which are neutral among the US labs to coordinate with China. And the outsiders seem well-positioned to form those parties.
And the outsiders are much less important in Plan C-E because the coordination and verification would occur between the labs directly (e.g. labs agreeing to audit each other’s models, or share safety research, or mutually slow down).
Achieving government buy-in
I discussed above how outsiders might be useful in Plan B (and somewhat in Plan A) as neutral third-parties. What about for achieving a Plan B in the first place? That is, how can outsiders gain enough government buy-in for serious government involvement to make a big difference to the strategic picture?
Activities that seem good here:
Better futures stuff
Some outsiders are doing things other than directly avoiding AI takeover, which are still important for achieving a near-best future. Examples:
Minor considerations
Here are other miscellaneous considerations I’ve seen for how outsiders are helpful.
Here some questions on my mind, which I might discuss in future articles:
“In the meantime it will have become very hard for you to learn from anybody who doesn't have these clearances. Because you'll be thinking as you listen to them:
'What would this man be telling me if he knew what I know? Would he be giving me the same advice, or would it totally change his predictions and recommendations?'
And that mental exercise is so torturous that after a while you give it up and just stop listening.
I've seen this with my superiors, my colleagues….and with myself. You will deal with a person who doesn't have those clearances only from the point of view of what you want him to believe and what impression you want him to go away with, since you'll have to lie carefully to him about what you know. In effect, you will have to manipulate him. You'll give up trying to assess what he has to say. The danger is, you'll become something like a moron. You'll become incapable of learning from most people in the world, no matter how much experience they may have in their particular areas that may be much greater than yours."
Why? They know the available techniques; they know which techniques have been implemented; they have access to unpublished data on those techniques; they have more compute to study the techniques; they have the permissions to choose which techniques to implement; they have the model access to assist in the implementation.
See here for Ryan's taxonomy. I'm slighly abusing his terms, because his "Plan D" and "Plan E" refers to scenarios where the leading lab have ~10 safety-minded insiders and no safety-minded insiders, respectively. But I'm using these as adjectives for a lab, regardless of whether it's leading or not.
"Obviously what happened is Burns was bumped because of his association with Anthropic. A dumb but predictable own goal. A lib admin would have done the same to an xAI technical safety researcher, assuming any of those still exist." — Dean Ball