Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The wider AI research community is an almost-optimal engine of apocalypse. The primary metric of a paper's success is how much it improves capabilities along concrete metrics, publish-or-perish dynamics supercharge that, the safety side of things is neglected to the tune of 1:49 rate of safety to other research, and most results are made public so as to give everyone else in the world a fair shot at ending it too.

It doesn't have to be this way. The overwhelming majority of the people involved do not actually want to end the world. There must exist an equilibrium in which their intentions match their actions.

Even fractionally shifting the status quo towards that equilibrium would have massive pay-offs, as far as timelines are concerned. Fully overturning it may well constitute a sufficient condition for humanity's survival. Yet I've seen precious little work done in this area, compared to the technical questions of AI alignment. It seems to be picking up in recent months, though — and I'm happy to contribute.

This post is an attempt at a comprehensive high-level overview of the tactical and strategic options available to us.


1. Rationale

Why is it important? Why is it crucial?

First. We, uh, need to make sure that if we figure alignment out people actually implement it. Like, imagine that tomorrow someone comes up with a clever hack that robustly solves the alignment problem... but it increases the compute necessary to train any given ML model by 10%, or it's a bit tricky to implement, or something. Does the wider AI community universally adopt that solution? Or do they ignore it? Or do the industry leaders, after we extensively campaign, pinky-swear to use it the moment they start training models they feel might actually pose a threat, then predictably and fatally misjudge that?

In other words: When the time comes, we'll need to convince people that safety is important enough to fuss around a bit for its sake. But if we can't convince them to do that now, what would change then?

I suppose having a concrete solution, instead of vague prophecies of doom, would give us more credibility. But... would it, really? And what if we won't have a concrete solution even then, just a bunch of weird heuristics that may nonetheless measurably improve our odds?

The latter seems reasonably likely, too. As this excellent post points out, most of the contemporary deep-learning progress comes from "messy bottom-up atheoretical empirical tinkering". If AGI comes from DL, it's plausible that, even if we arrive at the solution to alignment from mathematical foundations, the actual implementation will take the form of messy hacks. Ones that will probably need to be fine-tuned for any given model architecture. And given the no-fire-alarm principle[1], to be safe, we'll need to ensure that any sufficiently big model is only run with these hacks built-in.

If any given AI researcher is still not taking alignment seriously by then, how will we make them bother with all of that every time they run an experiment? How will we ensure they don't half-ass it even once?

Second. If that figure I quoted, the 49:1 ratio, is even remotely correct, there's plenty of room to upscale our research efforts. Imagine if every researcher started spending 2% more of their time thinking about alignment. That'd double the researcher-hours spent on the problem!

Which doesn't directly translate into 2x progress, I'll grant. Given the field's pre-paradigmic status, the returns to scale might be relatively small... but by no means negligible. Even if we don't necessarily have research directions clearly outlined, having much more people stumbling around in the dark still increases the chances of bumping into something useful.

Another argument I've seen is that upscaling may increase the capabilities gain too much. I don't find this convincing:

  • I'm doubtful an hour spent on alignment research can reasonably be expected to speed up capabilities more than an hour spent directly trying to increase capabilities. I don't rule that out, though.
  • That isn't an argument against upscaling, it's an argument against incompetent upscaling. If it's true, that means we'll need to introduce strict information-security measures. It'd be a (significant) complication, but not a "no".

(The progress on mechanistic interpretability, in particular, can absolutely be usefully upscaled. One of the main bottlenecks there is understanding individual circuits, which is a (relatively) low-skill task that's 1) mainly bottlenecked by researcher-hours, not top-researcher-hours, 2) can be easily parallelized. It's probably not the only research direction like this, too — we just need to look for them.)

Third, slowing progress down. This one is straightforward enough, I suppose.

  • Every researcher-hour or dollar we can convince to spend on alignment research is an hour or dollar not spent on capabilities research.
  • Every day we buy is a day we might solve alignment.
  • The effects could be fairly outsized. The top researchers are orders of magnitude more productive than the median ones, most of the progress is funded by a few multi-million corporations, and most of the compute in the world is controlled by a few private and state actors. We don't have to convince half the world to make the timelines twice as long; just a few people.
  • Even if we can't buy enough time to solve alignment, every day we buy is a day the world doesn't end. If life is worth living, that has to mean something.

2. Existing Work

  • Pragmatic AI Safety seems to be aimed at the same thing I'm arguing for, and their work had a significant impact on this post.
    • I disagree with some of their mission parameters, however. See 3.4.
  • Chris Olah suggests some interesting interventions here.
    • A hypothesis that achieving a fractional mechanistic understanding of neural networks would create a snowball effect. The status quo is that it's okay to have models be black boxes, but opening them up a bit would put into stark relief what we don't know about them, and that may make researchers at large more worried/interested in understanding them.
    • A strategy where we recruit neuroscientists to work on mechanistic interpretability. Chris suggests it won't require much re-training.
    • The Distill project, which attempts (attempted) to soften the publish-or-perish dynamic and make interpretability projects appealing to entrants in the field.
  • Logan Riggs suggests to pay the top researchers money to work on alignment, with the idea that either they'll solve it (unlikely) or appreciate the difficulty of the problem and start taking it more seriously.
  • Here Eliezer makes some off-hand suggestions; e. g., that turning OpenAI into ClosedAI and generally making research more closed-doors would be useful (if insufficient).
  • lc's call for more serious activism, with which I fully agree.
  • Not Relevant's post. Though it's retracted, I appreciate it and the discussion it generated.
  • AI Safety Arguments Competition.

I'm sure I've missed a lot of things, but this seems like a good start.

3. Types of Interventions

I would broadly categorize them into the following:

  1. Direct appeals to "insiders": researchers and the leadership of AI labs. Difficult, but worthwhile. Most of them already know about AI risk and have dismissed it for one reason or another, but convincing the industry leaders would be highly valuable.
  2. Indirect appeals to insiders. If we can't convince them that their work is going to end the world, perhaps we can attract them away from it by other means, such as better career opportunities, more interesting research directions, or higher-ROI projects?
  3. Appeals to "outsiders": general public, governments. Very tricky. Achieving any effect at all would be difficult, but it also needs to be done carefully, lest we make the situation worse.
  4. Finding a way to progress AI Safety that fits the current tastes of AI researchers. E. g., practically-useful scalable mechanistic interpretability techniques.
  5. Shifting academic incentives or changing research tastes. E. g., Chris Olah's suggestion that achieving partial interpretability would make people more interested in understanding ML models.

If you feel that any of this is very ill-advised or icky, GOTO 6.

3.0. Effective Strategies for Changing Public Opinion

The titular paper is very relevant here. I'll summarize a few points.

  • The main two forms of intervention are persuasion and framing.
  • Persuasion is, to wit, an attempt to change someone's set of beliefs, either by introducing new ones or by changing existing ones.
  • Framing is a more subtle form: an attempt to change the relative weights of someone's beliefs, by empathizing different aspects of the situation, recontextualizing it.
  • There's a dichotomy between the two. Persuasion is found to be very ineffective if used on someone with high domain knowledge. Framing-style arguments, on the other hand, are more effective the more the recipient knows about the topic.
  • Thus, persuasion is better used on non-specialists, and it's most advantageous the first time it's used. If someone tries it and fails, they raise the recipient's domain knowledge, and the second persuasion attempt would be correspondingly hampered. Cached thoughts are also in effect.
  • Framing, conversely, is better for specialists.

It should really go without saying, but epistemic hygiene stays important here. We're not "aiming to explain, not persuade" anymore, we're very much aiming to persuade, but that in itself is not unethical. It's no excuse to slip into an arguments-are-soldiers employ-all-the-fallacies lie-your-head-off mindset. (If you need a "pragmatic" reason not to do that, GOTO 5.)

3.1. Straightforward Appeals to Insiders

As per above, we'd be fighting an uphill battle here. Researchers and managers are knowledgeable on the subject, have undoubtedly heard about AI risk already, and weren't convinced. Arguments that recontextualize AI risk, AI, or existential risks in general, are likely to be more effective than attempts to tell them things they already know. They're more likely to misprioritize safety, rather than be totally ignorant of the need for it.

An in-depth overview of the best ways to craft an argument is beyond the scope of this post (though this might be a good place to look). Two important meta-principles to keep in mind:

  • Know your audience. The shorter the distance between your model of your audience and their actual beliefs, the more effective your arguments will be.
    • Why are AI specialists who know about AI risk but aren't convinced aren't convinced? Why is the specific person you're talking to not convinced? What language or terminology do the people in your audience respect, and how can you translate your arguments into it?[2] What language or terminology do they dislike, such that using it would downgrade your argument?
    • For example, talking about impending AGI, or about superintelligences taking over the world, is very much a faux pas. It's better to taboo these terms, and replace them with specialized context-appropriate definitions. Most of our arguments work perfectly well like that.
    • (If you want to get acquainted with the viewpoint, here you will find 11 interviews with AI researchers about AI Safety.)
  • Don't treat your interlocutors as enemies, or ideological opponents. This style of engagement is notoriously ineffective, and may end up directly counter-productive: burn bridges, severely compromise further persuasion attempts, or even damage our ability to collaborate at all. It's a radioactive mindset.

There are two tacks to take here: macro-level and micro-level.

Macro. Broad appeals to the entire industry, with the aim of changing the agreed-upon social reality, de-stigmatizing AI Safety, and so on. Concrete projects may look like this.

Micro. Targeted efforts to convince industry leaders. As per, 50% of AI progress is made by fewer than 50% of the researchers; orders of magnitude fewer than that. Similarly, getting the leadership of DeepMind and OpenAI fully on our side would have an outsized impact. In theory, a project here may go all the way down to "find effective arguments to convince this specific very important person" levels of fidelity.

I'm more optimistic about the second tack, and generally about activism that has precise focused short-term objectives whose success or failure can be clearly evaluated, and which we can quickly iterate on.

One of the flaws of the "micro" approach is that our victories may be washed away by a paradigm shift. Most of the top GOFAI researchers didn't keep their positions into the ML era, and the top ML researchers may not survive into the next one. I expect this isn't much of a problem, though. If we manage to convince the leading researchers, their views should quickly trickle down to the rest of the field, and the field's structure is likely to survive an upheaval.

3.2. Sideways Appeals to Insiders

There's one dimension along which we can broaden our standards for persuasion.

When trying to influence people — either individually or en masse — we usually argue that addressing existential risks is necessary because, duh, the looming end of humanity. The importance of that work should be self-evident to any moral person. They'd agree with us if we can only make them recognize the existential threat for the existential threat it is. No, it isn't just sci-fi! Yes, working out these weird math problems really can save the world!

But it's not the only reason someone might decide to work on AI Safety. People's career choices are motivated by all kinds of things:

  • They want to work on cool stuff.
  • They want a flexible work agenda.
  • They want to be paid well.
  • They want to advance their career.
  • They want to work in a culture they like.
  • They want to be put into contact with influential people.
  • They want to be respected, or do something that's viewed as respectable.

And so on. This approach is relatively neglected, I suspect, because it's steeped in ulterior motives. There's a very specific reason we'd be making these arguments, and it's not because we want our interlocutors to have fun/get rich/etc.

But there's no reason we can't be open about these motives, which would take all under-handedness out of it. And there are many legitimate reasons to prefer AI Safety over "mainstream" capabilities research:

  • Our field is less saturated. There are more low-hanging fruits, more chances to make a major contribution.
  • We're not lacking in funding: they may well get higher pay here.
  • Significant progress can be made without access to vast amounts of compute.
  • Some of the AI Safety subfields are very unlike mainstream ML (more focus on math, etc.), so some people might find them a better fit for their skills/tastes.
  • We do seem to have a pretty good culture.

Overall, I don't expect this approach to work on the top-tier researchers, for obvious reasons. But it might work to attract the people entering the field, potentially en masse. It might also work as a good supplement to straightforward appeals: when we're trying to convince someone not to work on X, it's good to have a ready offer for what to do instead. Even better if that offer is more lucrative than their current job.

Another thing to keep in mind: trivial inconveniences. Making entering or transitioning to AI Safety 10% easier might have disproportionate effects, like doubling the amount of entrants.

Concrete projects in this area may involve creating organizations working on reducing AI risk that make competitive job offers, providing funding or career assistance to ML specialists, explaining how to start, advertising personally appealing features of working on AI risk, encouraging people to have fun, and lowering the barrier to entry by distilling research. On that note, Chris Olah's Distill project is also a good example of a "macro-level" intervention of this type, though it's on hiatus/potentially failed.

3.3. Appeals to Outsiders

Any effective work along this dimension requires answering an exciting question: how do you put out a flame using a flamethrower?

Perhaps that's a bit harsh. Perhaps even counter-productively harsh, given my previous calls for treating audiences with respect. But let's not kid ourselves: we've seen how the world handled COVID-19.

An initiative that pushes for X might convince people or governments to do anti-X instead. If we convince them to do X after all, they might do extremely ineffective things that accomplish nothing, or even somehow do things that actually make anti-X happen. And conversely, activism completely unrelated to X might make it happen!

Good news, though: COVID-19 had shown us just how badly things are broken. Keeping the Simulacra Levels and the autopsies of the failures in mind, it might be possible to find interventions that have the effects we want.

That's explicitly what we'd be doing, though: deciding what effect we want to cause, then searching for an action that would cause it, once propagated through the broken pathways of our civilization. For that reason, I'm not making the distinction between "straightforward" and "sideways" appeals here: surface-level efforts to achieve something aren't strongly correlated with that thing happening, even given their surface-level success. All appeals are sideways appeals.

Having a good model of realpolitik is a necessity here.

The general principles of "know your audience" and "maintain epistemic hygiene" still apply, though. The inference gap is much larger, but that has its advantages: direct persuasion would be more effective, on average.

Useful consequences in this area may include:

  • Passing laws that introduce new regulations over AI development.
  • Passing laws that mandate major AI labs to spend a fraction of their budget on AI Safety.
  • Passing laws that increase oversight of the use of large quantities of compute.
  • Pushing back on research transparency in AI. Creating the sociopolitical conditions under which long periods of closed-doors research are possible.
  • Convincing major cloud computing providers to only provide vast amounts of compute if the request has been approved by an AI Safety committee.
  • Restricting the supply of GPUs and other specialized computing modules.
  • Raising the societal levels of concern over massive AI models.
  • Convincing public figures (celebrities, billionaires) to put pressure or raise concerns with regards to AI Safety.
  • Putting (diffuse) political or social pressure on major AI labs to prioritize safety.
  • Causing major economic actors to divest the entire AI industry, or specific labs.

Again, causing these consequences is not as simple as pushing for them on the object-level. Lobbying for laws that regulate AI is likely to lead to poorly-targeted bans with lots of loopholes that just burn our political capital; mandating AI Safety oversight is useless if the safety committees will consist of yes-men; trying to rile up the public might well see their ire redirected our way.

Some inspirations here might be what had happened to nuclear energy or human cloning, or how the woke movement had managed to infiltrate the corporate/academic culture. Anti-corporate movements might be good allies of convenience here in general. An anti-inspiration, what not to do, is what's happened to cryonics.

Warning: This entire space of interventions has elevated levels of background radiation. Most interventions of this type are deceptively ineffective, and if you're aiming for impact first and foremost, it'd be very easy to slip into bad epistemic habits or unethical practices.

Moreover, it's necessarily antagonistic with regards to AI researchers and major AI labs. Any success here would worsen the public or legal landscape for them, and they'd be less likely to listen to straightforward appeals from us.

I've not completely despaired of this approach; the payoffs are significant, and I'm sure there are some interventions that are at once effective and reliable and ethical and only scorch our bridges. Furthermore, if straightforward appeals and other cooperative ideas won't work, pivoting to this is an obvious Plan B.

As far as directly useful consequences are concerned, I would empathize well-targeted interventions with easily evaluated victory/failure conditions, even more so than in 3.1.

However, the best plays here might be tactically useful interventions, aimed at what Pragmatic AI Safety calls diffuse factors. Such interventions don't directly decrease AI risk, but they create favourable conditions for other interventions. E. g., causing the public to be vaguely concerned about AI won't directly help, but a world in which the public is concerned is a world in which we're more well-positioned to influence the AI industry in other ways. (I'll come back to that in Part 4.)

3.4. Joining the Winning Side

In some sense, the easiest way to accomplish our goal is not to try to change the AI industry's incentive structures, but to ride them. The industry as a whole is agnostic with regards to alignment. It cares about:

  • Easily measurable success metrics.
  • Clear, factorizable research directions.
  • Profit, funding.

Current alignment research directions are none of these things. Progress is difficult to measure, the problem doesn't cleanly break down into sub-problems you could work on in isolation, and the results don't translate to e. g. more powerful and marketable ML models. Reducing the alignment problem to this sort of "digestible" form is non-trivial — that's the whole problem with our lack of an established paradigm.

But. I think there are certain potential avenues of alignment research that are relatively amendable to being transformed into a form the AI industry would find digestible, while also being pretty effective and practical ways to make progress on alignment.

And herein lies my disagreement with Pragmatic AI Safety. They suggest that alignment research should aim for zero capabilities advancement. I think the metric to keep in mind here, instead, is expected return on capabilities for an hour of research.

  • As far as influencing a current AI researcher is concerned, if we can make them pursue a research direction that somewhat progresses alignment while progressing capabilities not faster than what they'd be doing otherwise, our impact is positive.
    • Even if the research direction doesn't progress alignment very much on a per-hour basis, recall how relatively tiny AI Safety is. A significant fraction of AI researchers starting to progress alignment at 5% the rate of a specialized alignment researcher might double or triple raw progress industry-wide.
  • It would, in fact, be ideal to find some research direction that has a fractionally higher return on capabilities as whatever the ML field is doing right now, but which also significantly progresses alignment. It wouldn't, in practice, shift 100% of the AI industry there, but it would go a great way towards that.

It pays to play along with the current industry incentives, most notably the profit motive. In addition, the idea of aiming for minimal capabilities externalities seems deeply unnatural for me for other reasons:

  • In terms of "in-house" research, our field is, again, tiny. Any progress we make on capabilities will be a rounding error next to the rest of the AI industry's. We should aim for maximum alignment progress, period, it's difficult enough without any additional constraints.
  • Moreover, I don't think alignment and capabilities are orthogonal. I think they're very much positively correlated. Alignment could be viewed as an interface, or as being able to make a program Do What You Mean. A ML model that goes crazy off-distribution and kills everyone isn't just poorly aligned, it's also not very good at demonstrating good performance along the metrics the researchers actually care about (relevant post). The unidentifiability problem isn't just a mesa-optimizer problem, it's also an acknowledged generic DL problem.

All in all, I'm optimistic about the existence of research projects that are at once 1) quickly reachable, 2) would find traction with the current status quo of the AI industry, 3) efficiently progress alignment. Crucially, finding such a research direction would nearly guarantee that the alignment solution is implemented in any future major model.

The most obvious candidate is mechanistic interpretability, of course (and it's no coincidence that it seems to be the most popular AI Safety direction outside our circle), but I have a few other ideas that I hope to post soon.

3.5. Influencing the Research Culture

All of the other approaches attempt to influence the AI industry through intermediaries: through the research projects it pursues, through the people it's implemented on, through the wider social environment it's embedded in. But perhaps there is room for a more direct intervention?

The industry is a social construct. The qualities that make a project a good one, the tastes the researchers have, the incentives they operate under — all of this is, to some extent, arbitrary. It has a ground-truth component, but the current configuration is not uniquely determined by the ground truth of the research subject. Rather, it's defined by weights that this social construct currently assigns to different features of the ground truth.

The current AI industry prefers tinkering to empiricism, and capabilities to safety. How can we shift this?

There's been two proposals that I've already mentioned:

  1. Chris Olah's: Improve interpretability techniques enough that understanding some fraction of your model becomes normalized. Conceit:
    • Right now, not knowing anything about how your model works internally is completely normal. If even partly understanding one’s model became normal, however, then the amount we don’t know might become glaring and concerning. Chris provides the following analogy to illustrate this: if the only way you’ve seen a bridge be built before is through unprincipled piling of wood, you might not realize what there is to worry about in building bigger bridges. On the other hand, once you’ve seen an example of carefully analyzing the structural properties of bridges, the absence of such an analysis would stand out.
  2. Logan Riggs': Pay N top ML researchers a fat chunk of money to work on alignment for 3 months, with the promise of a truly vast amount of money if they solve it.
    • Conceit: They won't manage to do it, but forcing them to engage with the problem would make them appreciate its difficulty, and shift research tastes field-wide towards it.
    • Issue: Defining what "working on the alignment problem" means, and ensuring they actually try to solve it, instead of "trying to solve" it but actually working on some trivialization of it, then disagreeing that their solution doesn't work.

I think something like 2) is worth implementing. I'm unclear on how to evaluate 1); I'm guessing mechanistic interpretability just hasn't progressed that far yet. If we generalize from those two, though...

We want to synthesize a construct C with the following properties:

  1. C is true.
  2. C could be easily made part of the social reality of the AI industry or academic culture.
  3. Common knowledge of C changes the AI industry's incentive gradients towards less focus on capabilities improvement and/or towards more work on AI Safety.

What is a fact such that, if the researchers knew of it and knew other researchers or the general public or the grant-makers also knew of it, would make them pursue alignment research? What is a fact that would change the sociopolitical landscape such that the incentives shift, even fractionally, towards the things we want?

"Sufficiently powerful non-aligned AI is going to end the world and aligning AI is difficult" as C evidently doesn't satisfy the second criterion. Logan's idea is to force it into the industry's social reality by a monetary injection. Chris Olah's idea has a more niche definition of AI risk as C, but either 2) is still difficult, or 3) didn't work out.

Both approaches take C for granted, then attempt to find paths to satisfying 2). I think there's some promise in directly mining the concept-space for a C that'd have the desired properties "out of the box". It's essentially the "sideways appeals" approach writ large.

I suspect studying the history of science for cases where research tastes/standards changed would be useful here. An example of such a shift might be the replication crisis in psychology (in the broadest of terms).

I suspect this approach hasn't been exhausted, given that it's pretty non-intuitive.

I suspect this idea is overly clever in a very stupid way and will not actually work.

Still, if it does work, there are probably some low-hanging fruits there. And just intuitively... If you go looking for a reason, you generally find one, don't you?

4. What You Can Do

The logistics graph that leads to a superintelligent AI's deployment has many bottlenecks, and controlling any one of them would be sufficient. Taking over the researcher supply, or the money supply, or the compute supply, or the research project supply, or the reputation supply, or the supply of any other crucial resource I'm not thinking of, would ensure excellent conditions for a safe advanced AI to emerge.

But the path to this doesn't look like a concentrated push along the corresponding dimension. As Pragmatic AI Safety points out, diversification is key.

There are interdependencies everywhere: success at one thing affects the probabilities of success of all other projects. Finding an appealing research direction would make it easier to attract people our way. Putting social pressure on major AI labs would make safety-adjacent research directions more appealing. Shifting research tastes in a subfield would make it easier to change people's minds. And so on.

Moreover, it's not obvious what bottleneck would be the easiest to gain control of, without the benefit of hindsight. Future events and novel discoveries may shift any part of the landscape in unpredictable ways, open or close doors for us.

Improving AI Safety's future position means pursuing a strategy that is robust to such random environmental fluctuations. It means maximizing our far-away action space. We need to have a diversified portfolio of plans; we need to be improving our position all across the board, always looking for what new opportunities have arisen.

In theory, it would be great to have central coordination. Some organization or resource which tracks the feasibility of various interventions across the entire gameboard, and pursues/recommends those that move the gameboard into the most advantageous states while spending the least resources, and also you should put me in charge.

In practice, this sort of coordination is both difficult and fragile, with a single point of failure. We're not a single organization, either, but a diverse conglomerate of organizations, movements, groups and individuals.

But we can approximate central coordination.

It's often pointed out that impact in the modern world has a tail-heavy distribution. In some areas, it's effective to have many separate groups putting their full strength behind diverse high-variance projects. Many of them will fail, but some will succeed massively.

The project of advancing AI Safety is, to a large extent, one such area.[3] My general advice would be as follows:

  • Be opportunistic. Keep an up-to-date mental picture of the world, look for local opportunities to advance our cause, and coordinate with others where possible.
    • A good project also creates opportunities upon success; ideally, the whole thing should work as a positive feedback loop.
  • Use your Pareto frontier. We need diversity; we don't want our projects to converge because we're using the same heuristics to pick them. Look for what you or your group specifically is uniquely well-placed or well-suited to do, and do that.
  • Be ambitious. Again, tail-heavy distribution: going for low-success-rate high-reward plays is a good policy to adopt (as long as the only thing at stake is your plan's success!).
  • Cheat. Not in the sense of being unethical, in the sense of reframing and revising the problem so that you can achieve 80% the impact with 20% the effort. It's not always possible, but it frequently is, in sociopolitical interventions especially. Tug sideways.
  • Keep scale in mind. Changing the value of a crucial industry-wide variable by 1% is enormous impact.

There's a caveat here, though: without central coordination, how can we ensure that none of these disparate projects trip each other up? As I'd mentioned, successfully appealing to governments might mess up our relations with AI researchers, and failed persuasion attempts (macro- and micro- both) often make subsequent ones much harder.

Some amount of that is probably inevitable. Part of it can be mitigated by trying out minimal/small-scale versions of any projects that might result in net-negative impact on failure. But a much larger chunk of it could be mitigated by ensuring that we—

5. Avoid Thermonuclear Ideas

You likely know what I'm talking about. The class of ideas that includes lying and manipulation as its most tame members, and expands to cover some much worse extremes.

I know some of these ideas may seem very clever and Appropriately Drastic, and the stakes — literally astronomical — could not be higher. We're accelerating directly into a wall, and our attempts to swerve away seem ineffectual. It may feel emotionally resonant to resolve to Stop Being Nice and Pull Out All the Stops and solve the problem in some gravely decisive fashion, By Any Means Necessary.

But it will not work in the real world, outside fantasies. It will not solve the problem in the long term, and in the meantime it will crash and burn, and hurt people, and ruin our PR, and tank the chances of other, more productive and realistic approaches. Even if you think your idea will definitely succeed, you're failing to think at scale. What would you expect to work better: a policy under which some of us pursue plans that blow up so hard they set us collectively back a few years, or a policy under which our plans only ever compound on each other's successes?

Following the first policy is a defection, not just against the rest of society, but against all our other risk-mitigation initiatives. We're better than this.

As a rule-of-thumb, you can use something like Shannon's maxim. If whatever clever plan you're considering and the entire causal chain that led to it became common knowledge, would it fail and destroy our credibility and our other plans? If yes, this is a radioactive plan, get it away.

Things that seem like ruthless pragmatism are frequently not actually ruthlessly pragmatic. They're just excuses to indulge your base instincts.

Be cool, in general. Find ways to be cool about this mess. We have resources for that and everything.

6. The Thin Line

I concur with lc's post and the people in that post's comments: we have a slight taboo against the sort of full-scale activism I'm arguing for. It's exemplified by this sort of sentiment. I suspect it's a combination of two things:

  • An instinctive desire to stay far, far away from the radioactive plans I've described in the previous section.
  • A failure to shift between epistemic and instrumental rationality; between enforcing community norms internally and having an external impact.

It makes sense that it exists. One of the foundations of this movement is "raising the sanity waterline" — and approaching interactions with people outside the movement with less rigor is not how you set an example. It's also easier to enforce the same norms upon yourself and each other in all situations, instead of switching between different sets depending on context.

I'm tempted to say that we've overcorrected here; that we can or must relax our standards somewhat, in the light of shortening timelines and in the face of our slow progress.

But I'm not sure. Relaxing the standards is absolutely a slippery slope. This decision might be "meta-radioactive", in the sense that it will see us accelerating straight into the epicenter of a nuclear explosion.

I don't know how to strike the right balance here. It definitely seems like we can opt to be more effective without inching towards self-defeating Stupid Evil, but maybe the mere act of acknowledging that possibility would shift our social reality in undesirable ways? Maybe John's position is right, and we should call out our epistemically suspect behavior even as we agree that it's the right thing to do.

7. Closing Thoughts

The recent months have seen increasing amounts of alarm and doom-saying in our circles. AI capabilities are advancing rapidly, while our attempts to align it proceed at a frustratingly slow pace. There are optimistic voices, but the general disposition seems quite grim.

Well. If alignment is really so hard, maybe we should quit trying to solve it?

In hindsight, I'm a bit baffled that field-building wasn't our main focus this entire time. Getting the AI industry to take AI risk seriously is a necessary and sufficient condition for survival. Solving alignment by ourselves is... neither. If the technical problems are truly insurmountable in the time we have left — and I don't yet know that they are, but I can certainly imagine it — we should just shift our focus to social-based solutions.

The goal, I should note, is not outreach. Convincing a few, or many, AI researchers to switch to alignment won't solve the problem where we have a multi-billion dollar industry stockpiling uranium in the hopes of spontaneously assembling a nuclear reactor. The aim should be to shift that status quo. Changing people's minds is a fine instrumental goal, but the terminal one is to influence the robust agent-agnostic process itself.

I'd like to suggest that there might be a snowball effect involved — that a 10% progress at this task would make the subsequent 90% easier, and so on. There might, indeed, be. I'm not that optimistic, though. I expect it'll be an uphill battle all the while, because the sort of carefulness we'd like to cultivate has the tendency to rot away, as organizations become corrupted and people value-drift.

It's possible that this is also impossible. That we can't change the AI industry in time, any more than we can independently solve alignment in time. But it seems less impossible to me.

And if we keep looking for approaches that are less and less impossible, perhaps we'll find one that isn't impossible at all.

  1. ^

    Which may or may not have been recently confirmed by this.

  2. ^

    Very important. See point 7 here.

  3. ^

    But not AI Safety itself, of course, only the project of spreading it. AI is very much Scott's Distribution 1, and the fact that our civilization is treating it as a Distribution 2 is the entire bloody problem.

New Comment
35 comments, sorted by Click to highlight new comments since: Today at 10:43 AM
[-]evhub2yΩ11340

As someone who has really not been a fan of a lot of the recent conversations on LessWrong that you mentioned, I thought this was substantially better in an actually productive way with some really good analysis.

Also, if you or anyone else has a good concrete idea along these lines, feel free to reach out to me and I can help you get support, funding, etc. if I think the idea is a good one.

(Moderation note: added to the Alignment Forum from LessWrong.)

I'd be curious to hear what your thoughts are on the other conversations, or at least specifically which conversations you're not a fan of?

My guess is that Evan dislikes the apocalyptic /panicky conversations that people are recently having on Lesswrong

That's my guess also, but I'm more asking just in case that's not the case, and he disagrees with (for example) the Pragmatic AI Safety sequence, in which case I'd like to know why.

I was referring to stuff like this, this, and this.

I haven't finished it yet, but I've so far very much enjoyed the Pragmatic AI Safety sequence, though I certainly have disagreements with it.

IMO prosaic alignment techniques (say, around improving supervision quality through RRM & debate type methods) are highly underrated by the ML research community, even if you ignore x-risk and just optimize for near-term usefulness and intellectual interestingness. I think this is due to a combination of (1) they haven't been marketed well to the ML community, (2) lack of benchmarks and datasets, (3) need to use human subjects in experiments, (4) it takes a decent amount of compute, which was out of reach, perhaps until recently.

Getting the AI industry to take AI risk seriously is a necessary and sufficient condition for survival.

I'm going to play devil's advocate against this claim. 

First: the AI industry taking AI risk seriously is not obviously a sufficient condition for survival. In the long run, the hard technical problems would still have to be solved in order to safely use AI. And there would still be a timer before someone built an unsafe AI: FLOPs would presumably still keep getting cheaper, and publicly-available algorithms and insights would keep accumulating (even if somewhat less quickly). Even with the whole AI industry on board, sooner or later some hacker would build an unsafe AI in their basement.

Getting the whole AI industry on board would buy time. It would not, in itself, be a win condition.

Second: getting the AI industry to take AI risk seriously is not obviously a necessary condition. It is necessary that people working on alignment have a capabilities lead. However, as you mention in the post:

Moreover, I don't think alignment and capabilities are orthogonal. I think they're very much positively correlated.

It is true that today's alignment researchers do not have any significant capabilities edge (or at least aren't showing it). But today's alignment researchers are also not even close to solving the alignment problem. I expect that an alignment research group which was able to solve the hard parts of alignment would also be far ahead of the mainstream on capabilities, because the two are so strongly correlated. I very much doubt that one could figure out how to robustly align general intelligence without also figuring out how to build it efficiently.

Strong positive correlation between alignment and capabilities research problems mean that non-alignment researchers win the capabilities race mainly in worlds where the alignment researchers aren't able to solve the alignment problem anyway.

Getting the whole AI industry on board would buy time. It would not, in itself, be a win condition.

Mm, I don't think we're disagreeing here, I just played fast and loose with definitions. Statement: "If we get the AI industry to take AI Safety seriously, it's a sufficient condition for survival."

If "we" = "humanity", then yes, there'll still be the work of actually figuring out alignment left to do.

I had "we" = "the extant AI Safety community", in the sense that if the AI industry is moved to that desirable state, we could (in theory) just sit on our hands and expect others to solve alignment "on their own".

I expect that an alignment research group which was able to solve the hard parts of alignment would also be far ahead of the mainstream on capabilities, because the two are so strongly correlated

But isn't that a one-way relationship? Progressing alignment progresses capabilities, but progressing capabilities doesn't necessarily strongly progress alignment (otherwise there'd be no problem to begin with). And I guess I still expect that alignment-orthogonal research would progress capabilities faster. (Or, at least, that it'd be faster up to some point. Past that point alignment research might become necessary for further progress... But that point is not necessarily above the level of capabilities that kills everyone.)

Specifically, do you agree with Eliezer that preventing existential risks requires a "pivotal act" as described here (#6 and #7)?

Eliezer did define "pivotal act" so as to be necessary. It's an act which makes it so that nobody will build an unaligned AI; that's pretty straightforwardly necessary for preventing existential risk, assuming that unaligned AI poses an existential risk in the first place.

However, the danger in introducing concepts via definitions is that there may be "pivotal acts" which satisfy the definition but do not match the prototypical picture of a "pivotal act".

Yeah, I guess the answer is yes by definition. Still wondering what kind of pivotal acts people are thinking about -- whether they're closer to a big power-grabs like "burn all the GPUs", or softer governance methods like "publishing papers with alignment techniques" and "encouraging safe development with industry groups and policy standards". And whether the need for a pivotal act is the main reason why alignment researchers need to be on the cutting edge of capabilities. 

I can't see how "publishing papers with alignment techniques" or "encouraging safe development with industry groups and policy standards" could be pivotal acts. To prevent anyone from building unaligned AI, building an unaligned AI in your garage needs to be prevented. That requires preventing people who don't read the alignment papers or policy standards and aren't members of the industry groups from building unaligned AI.

That, in turn, appears to me to require at least one of 1) limiting access to computation resources from your garage, 2) limiting knowledge by garage hackers of techniques to build unaligned AI, 3) somehow convincing all garage hackers not to build unaligned AI even though they could, or 4) surveillance and intervention to prevent anyone from actually building an unaligned AI even though they have the computation resources and knowledge to do it. Surveillance, under option 4, could (theoretically, I'm not saying all of these possibilities are practical) be by humans, by too-weak-to-be-dangerous AI, or by aligned AI.

"Publishing papers with alignment techniques" and "encouraging safe development with industry groups and policy standards" might well be useful actions. It doesn't seem to me that anything like that can ever be pivotal. Building an actual aligned AI, of course, would be a pivotal act.

"Building an actual aligned AI, of course, would be a pivotal act." What would an aligned AI do that would prevent anybody from ever building an unaligned AI?

I mostly agree with what you wrote. Preventing all unaligned AIs forever seems very difficult and cannot be guaranteed by soft influence and governance methods. These would only achieve a lower degree of reliability, perhaps constraining governments and corporations via access to compute and critical algorithms but remaining susceptible to bad actors who find loopholes in the system. I guess what I'm poking at is, does everyone here believe that the only way to prevent AI catastrophe is through power-grab pivotal acts that are way outside the Overton Window, e.g. burning all GPUs? 

"Building an actual aligned AI, of course, would be a pivotal act." What would an aligned AI do that would prevent anybody from ever building an unaligned AI?

My guess is that it would implement universal surveillance and intervene, when necessary, to directly stop people from doing just that. Sorry, I should've been clearer that I was talking about an aligned superintelligent AI. Since unaligned AI killing everyone seems pretty obviously extremely bad according to the vast majority of humans' preferences, preventing that would be a very high priority for any sufficiently powerful aligned AI.

Thanks, that really clarifies things. Frankly I’m not on board with any plan to “save the world” that calls for developing AGI in order to implement universal surveillance or otherwise take over the world. Global totalitarianism dictated by a small group of all-powerful individuals is just so terrible in expectation that I’d want to take my chances on other paths to AI safety.

I’m surprised that these kinds of pivotal acts are not more openly debated as a source of s-risk and x-risk. Publish your plans, open yourselves to critique, and perhaps you’ll revise your goals. If not, you’ll still be in a position to follow your original plan. Better yet, you might convince the eventual decision makers of it.

“It is necessary that people working on alignment have a capabilities lead.” Could you say a little more about this? Seems true but I’d be curious about your line of thought.

The theory of change could be as simple as “once we know how to build aligned AGI, we’ll tell everybody”, or as radical as “once we have an aligned AGI, we can steer the course of human events to prevent future catastrophe”. The more boring argument would be that any good ML research happens on the cutting edge of the field, so alignment needs big budgets and fancy labs just like any other researcher. Would you take a specific stance on which is most important?

There are three books that I massively recommend for anyone who thinks the AI industry is easy to reshape or influence in any direction. These books are Mearshimer's Tragedy of Great Power Politics and Nye's Soft Power (2004). The third is basically any book that covers the military significance of AI, in any way whatsoever, such as how AI is mounted on nuclear stealth missiles.

In addition, I highly recommend against trying to formulate (or even think about) AI policy without meeting a ton of people with experience with AI in the policy space. Trying to reinvent the wheel on this is a losing strategy, it's time-inefficient at best, and at worst it can attract unwanted attention from extremely wealthy, powerful, and vicious people. If your proposals are good, and many of them are, it's best to have them evaluated by experienced individuals who you know personally, not shoved in front of the eyes of as many strangers as possible.

This sounds important. Could you say more?

Yes, books are a big investment, so it was rude of me to fail to explain why it is worth people's time to look into getting them.

Mearshimer's Tragedy of Great Power Politics (Ch. 1 and 2): Explains in detail why governments and militaries keep doing all these horrible things, like gain-of-function research, or creating offensive nuclear stealth missiles that deliberately disguise their radar signiatures as computer glitches.

Nye's Soft Power (2004, Ch 1 and 4): Explains why governments take the media so seriously, and it gives one of the the best explanations I've seen for why massive, competent lies are critical for national security. Chapter 4 also gives a fantastic history of propaganda, including describing the nitty-gritty of how propaganda has become prevalent in modern media.

Both of these books are absolutely critical for anyone trying to understand AI policy, and only a small fraction of each book needs to be read in order to get 95% of the neccesary information.

I didn't mean to imply any rudeness on your part. Thank you for the recommendation and summary.

Could you say in short what the reasons Mearhimer and Nye give and how/why you think it impact on AI safety?

I think it would be good to hear some different perspectives on the issue of A(G)I policy, especially less socially desirable/cynical ones. 

Great post!

In hindsight, I'm a bit baffled that field-building wasn't our main focus this entire time.

I have a sneaking suspicion that the implicit thought process here is something like:

I'm a smart computer guy, and this looks like a really important technical problem! Therefore, I should help out by doing what I'm best at, which is writing software and mathematical proofs!

I'm not really one of those management-type community organizers, so I'll leave that part of the problem to someone else.

The problem being, of course, that smart technical people are way more likely to be convinced by x-risk arguments in the first place than management-type people, so if the nerds stay within their comfort zone, very little field-building will ever get done.

I have a pretty confident understanding that that is what happened, not a sneaking suspicion. 

This is an excellent post. It clarified much of my thoughts on the subject; I also hadn't stumbled upon the Pragmatic AI Safety sequence. Thank you very much for writing this.

There are some good thoughts here, I like this enough that I am going to comment on the effective strategies angle. You state that

The wider AI research community is an almost-optimal engine of apocalypse.

and

AI capabilities are advancing rapidly, while our attempts to align it proceed at a frustratingly slow pace.

I have to observe that, even though certain people on this forum definitely do believe the above two statements, even on this forum this extreme level of pessimism is a minority opinion. Personally, I have been quite pleased with the pace of progress in alignment research.

This level of disagreement, which is almost inevitable as it involves estimates about about the future. has important implications for the problem of convincing people:

As per above, we'd be fighting an uphill battle here. Researchers and managers are knowledgeable on the subject, have undoubtedly heard about AI risk already, and weren't convinced.

I'd say that you would indeed be facing an uphill battle, if you'd want to convince most researchers and managers that the recent late-stage Yudkowsky estimates about the inevitability of an AI apocalypse are correct.

The effective framing you are looking for, even if you believe yourself that Yudkowsky is fully correct, is that more work is needed on reducing long-term AI risks. Researchers and managers in the AI industry might agree with you on that, even if they disagree with you and Yudkowsky about other things.

Whether these researchers and managers will change their whole career just because they agree with you is a different matter. Most will not. This is a separate problem, and should be treated as such. Trying to solve both problems at once by making people deeply afraid about the AI apocalypse is a losing strategy.

Small request: given that it's plausible that a bunch of LW material on this topic will end up quoted out of context, would you mind changing the headline example in section 5 to something less bad-if-quoted-out-of-context?

Yeah, I'd been worrying about that as well. Though if anything, I'm more concerned about the examples I provide in the second paragraph of that section. The titular one is a bit absurd; the other ones are more plausible.

Edited all of that. Better?

Yepp, thanks!

Though if anything, I'm more concerned about the examples I provide in the second paragraph of that section

This is what I initially thought Richard's suggestion was referring to. I was thinking that you'd keep that the same structure minus some of the more extreme examples in the second paragraph. It would have been a much better solution; the "do not start a thermonuclear war" line was pretty funny!

It was, it was. Unfortunately, we can't have nice things.

I was thinking replacing it with something like "But No Galaxy-Brain Stuff, Alright?" and "radioactive plans" with "galaxy-brain plans", but that seems a bit too tongue-in-cheek/positive.

most academic research work is done by grad students, and grad students need incremental, legible wins to put on their CV so they can prove they are capable of doing research. this has to happen pretty fast. an ML grad student who hasn't contributed to any top conference papers by their second or third year in grad school might get pulled aside for a talk about their future.

ideally you want a topic where you can go from zero to paper in less than a year, with multiple opportunities for followup work. get a few such projects going and you have a very strong chance of getting at least one through in time to not get managed out of your program -- and of course, usually more will succeed and you'll be doing great.

I don't think there's anything like this in AI safety research. Section 3.4 seems to acknowledge this a little bit. If you want AI safety to become more popular, you'd hope that an incoming PhD student could say "I want to work on AI Safety" and be confident that in a year or two, they'll have a finished research project that they can claim as a success and submit to a top venue. Otherwise, they are taking a pretty huge career risk, and most people won't take it.

Important article with a bunch of very good ideas.

Wanted to add a simple one.

Physical proximity and microinteractions build trust. If you want to convince the people working on AI, you need to hang out with them more. Visits, internships, cooperations, conferences. Show them you are sane, show them you can play their games and conform to their standards. Listening. Learning what they care about, what terms piss them off, how they speak, what stuff they need and want, what you need to know and be able to do to be taken seriously by them, showing them respect and appreciation. If you are receptive to learning from them, they are more receptive to learning from you. A lot of concerns on the alignment and control problem have immediate payoffs for issues that they are concerned with now, e.g. making sure their current AIs are useful without giving malicious advice or reproducing racism. And with a problem as abstract and hard to approach as alignment, focussing on such specific problems can also be a path to new approaches and progress.

Thanks for this.  There's been an excess of panic and defeatism here lately, and it's not good for our chances at success, or our mental health.

This is actionable, and feels like it could help.  

Where's the money? People who want more AI safety research should be willing to pay for it. Other industries seem to understand that you need to pay for the 90% crap to get the 10% good.

AI safety is a coordination problem, whereas other ML research like the kind OpenAI does can, to a point, be captured for profit by the actors that pioneer it. 

This is a wonderful piece and echoes many sentiments I have with the current state of AI safety. Lately, I have also thought more and more about the technical focus' limitations in the necessary scope to handle the problems of AGI, i.e. the steam engine was an engineering/tinkering feat loong before it was described technically/scientifically and ML research seems much the same. When this is the case, focusing purely on hard technical solutions seems less important than focusing on AI governance or prosaic alignment and not doing this, as echoed in other comments, might indeed be a pitfall of specialists, some of which are also warned of here.