Who Aligns the Alignment Researchers?

Ben Smith

There may be an incentives problem for AI researchers and research organizations who face a choice between researching Capabilities, Alignment, or neither. The incentives structure will lead individuals and organizations to work towards Capabilities work rather than Alignment. The incentives problem is a lot clearer at the organizational level than the individual level, but bears considering at both levels, and of course, funding available to organizations has downstream implications for the jobs available for researchers employed to work on Alignment or Capabilities.

In this post, I’ll describe a couple of key moments in the history of AI organizations. I’ll then survey incentives researchers might have for doing either Alignment work or Capabilities work. We’ll see that it maybe that, even considering normal levels of altruism, the average person might prefer to do Capabilities rather than Alignment work. There is relevant collective action dynamic. I’ll then survey the organizational level and global level. After that, I’ll finish by looking very briefly at why investment in Alignment might be worthwhile.

A note on the dichotomous framing of this essay: I understand that the line between Capabilities and Alignment work is blurry, or worse, some Capabilities work plausibly advances Alignment, and some Alignment work advances Capabilities, at least in the short term. However, in order to model the lay of the land, it’s helpful as a simplifying assumption to examine Capabilities and Alignment as distinct fields of research and try to understand the motivations for researchers in each.

History

As a historical matter, DeepMind and OpenAI were both founded with explicit missions to create safe, Aligned AI for the benefit of all humanity. There are different views on the extent to which each of these organizations remains aligned to that mission. Some people maintain they are, while others maintain they are doing incredible harm by shortening AI timelines. No one can deny that they have moved at least somewhat in the direction of more profit-making behavior, and are very much focused on Capabilities research. So, at best, they’ve stuck to their original mission, but having watered it down to allow a certain amount of profit-seeking; at worst, their overall efforts are net-negative for alignment by accelerating development of AGI.

OpenAI took investment from Microsoft in January, to the tune of $10b. At the time, they said

This multi-year, multi-billion dollar investment from Microsoft follows their previous investments in 2019 and 2021, and will allow us to continue our independent research and develop AI that is increasingly safe, useful, and powerful.

And this seems plausibly like a systemic pressure other AI Capabilities researchers will face, too. Because of the disparate capital available, in order to fund research in AI Safety, any AI research organization will be incentivized to do capabilities research.

On the other hand, it’s striking that no organizations founded with the goal of AI Capabilities research have drifted towards Alignment research over time. Organizations under this category might include John Carmack’s recent start-up, Keen Technologies, Alphabet, and many other organizations. Systemically, this can be explained by the rules of the capitalist environment organizations work within. If you create a company to do for-profit work, and get investors to invest in the project, they’ll expect a return. If you go public, you’ll have a fiduciary duty to obtain a return for investors. For organizations, Alignment doesn’t earn money (except in so far as it improves capabilities for tasks); Capabilities does. As the amount of money available to investors grows, more and more will choose to invest in Capabilities research, because of the return available.

Incentives to research Alignment

First, let’s consider the incentives available to individuals. A self-interested, rational AI researcher can choose to work on Capabilities, or work on Alignment. What are the advantages for a rational researcher facing this choice?

There are a few. I’ve identified three:

You don’t want to die.
You don’t want humanity to die.
You will be respected by other people who admire you’re doing something to help prevent humanity dying. These incentives could be relevant to individual agents and corporations acting as agents.

How substantial will each of these incentives be? Let's take each in turn.

So you don’t die

How much is “you don’t want to die” worth? Empirically, for the average person, it’s not “everything you have”. The value of a statistical life is worth somewhere from $5 to $8m. These preferences can be inferred from studies that ask people how much they are willing to pay for a short reduction in life risk. But people value the lives of others, too, in such a way that might magnify the value of a statistical life by two or three. Overall we might imagine the adjusted value of a single statistical life is somewhere around $20m.

If your P(Doom) is 20%, and you place the value on your own life that an average person places on theirs, then the value of single-handedly avoiding death through AI misalignment is 20%*$6.5m, or $1.3m. But it’s highly unlikely a rational AI researcher will be 100% confident their work will definitely make the difference between saving the world or not; it is also seems somewhat unlikely a single research agenda will reduce P(Doom) from its current level to zero. Say a rational researcher believes there is a 5% chance their particular work will reduce P(Doom) by 5%, then the rational amount they’d be willing to pay is 5%*5%*$6.5m=$16,250. AI Researchers probably hold higher values of statistical life than the average person, because, on average, they have more income to spend on marginal safety improvements, so you can imagine the true value is several times higher than that–if you think 5x higher, then we get to $81,250.

On the other hand, if our Alignment researcher was choosing between graduate school with a $35k stipend and an entry-level software engineering job paying (conservatively taking the 25th percentile software engineering salary) $123k, their expected income sacrifice over the next five years is $440k.

In real life, choices are much more complicated than this–there are all sorts of different levels of remuneration for Alignment research vs. Capabilities research, or whatever a potential Alignment researcher might work in. Maybe it’s not a PhD vs. a lower-quartile software engineering job; perhaps it's’ doing independent alignment work for $90k vs. Capabilities work for $250k. But if we take the Value of Statistical Life research seriously, It’s far from clear that the value people typically assign to their own lives makes it worth the sacrifice for an average researcher to add their own marginal effort to the Alignment problem.

Generally speaking, there’s a collective action problem. If our AI researcher valued all 8 billion human lives at each worth to their own, then with the same marginal expected impact on P(Doom) as above, that specific one researcher working on Alignment would have a payoff of 5%*5%*$20m*8 billion=$400 trillion, or the 2021 world GDP for about five years.

Most potential AI researchers do genuinely care about others somewhat. So perhaps that makes working on Alignment worthwhile?

So humanity doesn’t die

Perhaps a motivator to work on alignment is to make sure that humanity doesn’t die. One paper suggested that altruistic concerns push up estimates of a value of a single statistical life, that is, how much the average person wants the government to be willing to pay (out of their taxes) to save a life by 2 or 3. However, in Alignment, we're not concerned with a value of a single statistical life; we're ultimately concerned about the value to a decision-maker of saving all lives. If most of us are scope-insensitive, possibly there are swiftly diminishing returns on how much we’re willing to pay to save other lives. Dunbar’s number suggests we can only maintain social ties with around 150 others. But it seems implausible we all intuitively value all 150 people the same amount as ourselves.

Assuming we’ll value others’ lives about 10% as much as our own, over 150 people, the value of working on alignment research as described above would be revised upwards from $16,250 to $243,750. On top of that, if you think alignment researchers have a Value of Statistical Life for themselves that is five times as much as the average person, then we’re getting to a value we need to be altruistic enough to take salary sacrifice to work on Alignment. So--depending on how much AI researchers differ from the average person, perhaps saving humanity really is enough of a motivator. It doesn't seem like a slam-dunk, though, and it's possible the market forces mentioned below in the "Organizational level" section coordinate to ensure the monetary return on Capabilities research is high enough.

It's possible that there are a large number of effective altruists out there who really do value the lives of billions of people at a non-zero rate, such that working on alignment to save humanity is a genuine motivator. It's possible there are quite a few long-termists who are concerned about ensuring there are large numbers of human-descended people living throughout the light cone. But overall, these folks are marginal, and probably don't represent the values of the average AI researcher deciding between Capabilities, Alignment, or an orthogonal project. Thus, even if long-termists and effective altruists overwhelmingly choose to work on Alignment research (an outcome that is very far from clear to me, in 2023), the marginal worker may still be faced with a higher incentive to work on Capabilities research.

For social reputation

If there is not sufficient intrinsic motivation for working on alignment, even when we consider altruism, perhaps building social capital motivates working on alignment. In general, altruistic acts accrue social reputation. Perhaps others respect people working on Alignment, but not Capabilities, because Alignment researchers work for the benefit of all whereas Capabilities researchers benefit themselves through their development of marketable AI technology while creating a negative externality of AI risk.

However, there is an important difference in how social capital vs. reduction in risk will accrue. Reduction in risk accrues specifically for impactful alignment research, but social capital accrues for alignment research that seems to be impactful to relevant others.

What are the failure modes here? There’s a very blurry line between Capabilities work and Alignment work. OpenAI and DeepMind do Capabilities work, but also do Alignment work. Some capabilities research is arguably necessary, in the big picture, to do Alignment work. Some work that at least appears to be useful for alignment may assist in advancing Capabilities. So, researchers working to accrue social capital rather than to save their own lives and the lives of those around them.

This is not something I expect many people will consciously calculate. In fact, I think most researchers who are aware of the AI risk problem and buy into it could recoil at the thought their work is net negative. But there are a million different ways to motivated reasoning yourself into thinking that your net-negative capabilities work is actually net-positive Alignment work.

Precisely because the line between Alignment and Capabilities work is blurry, and perhaps difficult to see much of the time, we are all vulnerable to a motivated reasoning process that leads people to do Capabilities work while telling themselves and others they are doing Alignment work.

Incentives to research Capabilities

There are several incentives for researchers to do research on advancing AI capabilities, in a way that is potentially distinct from Alignment work. These include:

Commercial opportunities
Recognition
Social impact
Funding opportunities.

In more detail:

Commercial opportunities

AI technology is in high demand in the tech industry, and researchers who develop new AI capabilities may have the opportunity to start their own companies, work for existing tech companies, or license their technology. Given the funding available for capabilities research (see next section) it seems likely this is a much more lucrative industry to be in relative to alignment research.

This might be the crux on which this entire post mostly stands or falls. If there is not more money available in Capabilities research than Alignment research, then Alignment seems just as appealing as Capabilities work, just on the direct monetary benefits, although there are (probably relatively minor) differences in recognition to consider as well. However, although I couldn't locate salary levels for Alignment vs. Capabilities work, I would be very surprised if there is not a disparity, considering the relative levels of funding available at an organizational level (see below).

Social Reputation

Advancing AI capabilities can lead to groundbreaking research that can gain recognition from the academic community. This recognition can translate into career advancement in the academic world and outside of it, as well as funding for future research.

One interesting dynamic of the incentive structures available for researchers is that there may be good reasons why recognition for achievements in Capabilities is more directly aligned to actual achievements, relative to recognition for research in Alignment, which might less directly track actual achievements. The reason for this is that it is relatively straightforward to identify Capabilities achievements because we can simply test an AI for the relevant capabilities. In contrast, although of course there is plenty of empirical testing that can be done to test out Alignment solutions, the end target of alignment remains somewhat mysterious, because no one knows exactly what we’re aligning to.

Social impact

Many researchers are motivated by the potential social impact of their work. Advancing AI capabilities can lead to breakthroughs in healthcare, environmental sustainability, and other areas that can improve people's lives. While Alignment work also has social impact, in terms of reducing P(Doom) this must compete with the possible benefits to Capabilities work.

Funding opportunities

Government agencies and private organizations often provide funding for research on advancing AI capabilities. Researchers who are successful in obtaining funding can use it to support their work and advance their careers.

On that note, what sort of funding opportunities are available?

Organizational level

The same incentives that apply to individuals in theory apply to corporations. The total available funding specifically for Alignment research is in the order of $10b, considering various sources of funding available to the LTFF and other sources. We can expect the yearly funding available to be substantially less than that. On the other hand, according to Forbes, investors and major companies intend to pour $50b into AI this year and $110b by 2024. While much of this money will go to implementing existing technology rather than pushing the envelope, we might also expect a snowball effect where investments in implementing existing tech fund the next generation of cutting-edge AI Capabilities research–for instance, Microsoft’s Bing Chat and Github Copilot are implementations of GPT-3 that followed a major investment by Mcirosoft in OpenAI, much of which will be spent pushing the envelope on Capabilities work.

All this is a very broad-brush first glance. I don’t mean to suggest there isn’t safety research at OpenAI, Deepmind, or any other specific organization who do capabilities research, nor even that these organizations are spending more money on Capabilities than Safety work. Even if these organizations do more Alignment research overall than Capabilities, there are many other potential competitors who can spring up to focus on Capabilities. Thus, leaders like OpenAI and Deepmind, to maintain their lead, must continue spending on Capabilities, or they’ll be outcompeted by others who prefer to spend only on Capabilities. Very roughly speaking,

One of the primary objections I’d consider to this is that commercial incentives don’t exactly target Capabilities work; they target productionization, or implementation, i.e., applying models to specific marketable tasks, like Copilot. So it is possible that the funding imbalance I’m describing won’t really disadvantage Alignment work, because when you look closer, there isn’t much money going to research after all. Overall, my current perspective is that this funding will primarily accrue to organizations who either produce the bleeding edge of capabilities research, or pay other people who do (e.g., an LLM-based chatbot app which uses OpenAI’s API under the hood), and without special intervention, that money will “trickle down” to Capabilities research rather than safety research, because the capabilities research funds the next generation of productionizable models.

A second objection might be that actually, productionizable models rely just as much or even more on Alignment research (e.g., RLHF), and so funding in the sector could spur innovation in Alignment even more than innovation in Capabilities. I think this is an argument worth considering, but I wouldn’t want to take it for granted. If it turned out to be true, I think it would still be worth exploring, because there may be ways to systemically steer incentives even more in this direction.

Global level

Though it’s not a focus here, I emphasize the geopolitical game theoretic factors are also fit into the question of Alignment broadly speaking. Competing great powers gain an edge over each other if they push the envelope on Capabilities research, but the benefits of Alignment work accrues to the whole world in general.

Across all levels

Broadly speaking, it seems like the same principles are at play when it comes to investment and for individual work, for organizations and individuals. There is more money available to fund Alignment research and the ROI for doing the research is higher. But for individuals, the case is more complex: social capital could also play a role. In fact, it could even be the overriding factor. However, it’s not clear whether social capital motivates work on Alignment, rather than things that look like Alignment.

Do we need investment in alignment research?

Maybe you’re convinced that the incentives favor researchers focusing on Capabilities over Alignment. But of course there is a distribution of motivations that researchers have and many will still prefer to directly work on Alignment. Perhaps, although the incentives tip towards Capabilities, we have sufficient incentives for Alignment to be solved. However:

Alignment researchers are up against all the people who say they work in Alignment but actually work in Capabilities
At least one prominent Alignment leader there is currently no promising path forward for Alignment research. If this view is correct, then perhaps it’s possible that increasing the amount of Alignment research, relative to Capabilities research, could improve the situation. On the other hand, it may be that research is generally futile, or moving the marginal researcher into AI Risk is futile. Eliezier Yudkowsky suggested a third option: that Alignment research is mostly futile, but it might be worthwhile buying out Capabilities researchers just to prevent them from pushing the envelope on Alignment:

If you gave Miri a billion dollars, I would not know how to–Well, at a billion dollars, I might try to bribe people to move out of AI development that gets broadcast to the whole world and move to the equivalent of an island somewhere, not even to make any kind of critical discovery, but just to remove them from the system, if I had a billion dollars.

Overall, if you think that more funding for research in Alignment could help reduce P(Doom), or you think that less funding for research in Capabilities could help by buying us more time, then perhaps you're buying into a version of the argument I'm making here.

Conclusion

Depending on researchers’ individual incentive structure, it may be more attractive to work in Capabilities than Alignment work. It might seem that reducing P(Doom) is all the incentive someone needs to do research in Alignment if they can, but a whole field of research into the Value of Statistical Life suggests there is a finite amount of money people are willing to pay to marginally reduce their income. For an altruistic and scale-invariant person, the expected payoff is enormous, but even though most people are willing to pay to help others, few are entirely scale-invariant, and reasonable estimates of scale-variance suggest the average person doesn’t rate saving humanity highly enough to make it worth doing Alignment research rather than the more rewarding capabilities research. Generating social capital could be a motivator for Alignment work, but social capital can also be generated by working in Capabilities, and whereas for Capabilities accruing social capital seems tightly aligned to actual Capabilities improvement, it isn’t clear whether the same is true for Alignment research.

LESSWRONG
LW