Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I work on AI alignment, by which I mean the technical problem of building AI systems that are trying to do what their designer wants them to do.

There are many different reasons that someone could care about this technical problem.

To me the single most important reason is that without AI alignment, AI systems are reasonably likely to cause an irreversible catastrophe like human extinction. I think most people can agree that this would be bad, though there’s a lot of reasonable debate about whether it’s likely. I believe the total risk is around 10–20%, which is high enough to obsess over.

Existing AI systems aren’t yet able to take over the world, but they are misaligned in the sense that they will often do things their designers didn’t want. For example:

  • The recently released ChatGPT often makes up facts, and if challenged on a made-up claim it will often double down and justify itself rather than admitting error or uncertainty (e.g. see herehere).
  • AI systems will often say offensive things or help users break the law when the company that designed them would prefer otherwise.

We can develop and apply alignment techniques to these existing systems. This can help motivate and ground empirical research on alignment, which may end up helping avoid higher-stakes failures like an AI takeover. I am particularly interested in training AI systems to be honest, which is likely to become more difficult and important as AI systems become smart enough that we can’t verify their claims about the world.

While it’s nice to have empirical testbeds for alignment research, I worry that companies using alignment to help train extremely conservative and inoffensive systems could lead to backlash against the idea of AI alignment itself. If such systems are held up as key successes of alignment, then people who are frustrated with them may end up associating the whole problem of alignment with “making AI systems inoffensive.”

If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI. In my personal capacity, I have views on what uses of AI are more or less beneficial and what regulations make more or less sense, but in my capacity as an alignment researcher I don’t consider myself to be in the business of pushing for or against any of those decisions.

There is one decision I do strongly want to push for: AI developers should not develop and deploy systems with a significant risk of killing everyone. I will advocate for them not to do that, and I will try to help build public consensus that they shouldn’t do that, and ultimately I will try to help states intervene responsibly to reduce that risk if necessary. It could be very bad if efforts to prevent AI from killing everyone were undermined by a vague public conflation between AI alignment and corporate policies.

New to LessWrong?

New Comment
21 comments, sorted by Click to highlight new comments since: Today at 4:31 AM

To be clear, I don't envy the position of anyone who is trying to deploy AI systems and am not claiming anyone is making mistakes. I think they face a bunch of tricky decisions about how a model should behave, and those decisions are going to be subject to an incredible degree of scrutiny because they are relatively transparent (since anyone can run the model a bunch of times to characterize its behavior).

I'm just saying that how you feel about AI alignment shouldn't be too closely tied up with how you end up feeling about those decisions. There are many applications of alignment like "not doubling down on lies" and "not murdering everyone" which should be extremely uncontroversial, and in general I think people ought to agree that it is better if customers and developers and regulators can choose the properties of AI systems rather than them being determined by technical contingencies of how AI is trained.

While it’s nice to have empirical testbeds for alignment research, I worry that companies using alignment to help train extremely conservative and inoffensive systems could lead to backlash against the idea of AI alignment itself.

On the margin, this is already happening.

Stability.ai delayed the release of Stable Diffusion 2.0 to retrain the entire system on a dataset filtered without any NSFW content. There was a pretty strong backlash against this and it seems to have caused a lot of people to move towards the idea that they have to train their own models. (SD2.0 appeared to have worse performance on humans, presumably because they pruned out a large chunk of pictures with humans in it since they didn't understand how the range of the LAION punsafe classifier, and the evidence of this is in the SD2.1 model card where they fine tuned 2.0 with a radically different punsafe value.)

I know of at least one 4x A100 machine that someone purchased for fine tuning because of just that incident, and have heard rumors of a second. We should expect censored and deliberately biased models to lead to more proliferation of differently trained models, compute capacity, and the expertise to fine tune and train models.

If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI.

Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash. The kind of backlash you're worried about seems likely only if, due to economies of scale, very few (competitive) AIs are built by large corporations, and they're all too conservative and inoffensive for many users' tastes. But in that scenario, AI could lead to an unprecedented ability to concentrate power (in the hands of AI developers or governments), which seems to be a reasonable concern for people to have.

It also does not seem totally unreasonable to direct some of that concern towards "AI alignment" (as opposed to only corporate policies or government regulators, as you seem to suggest), defined by "technical problem of building AI systems that are trying to do what their designer wants them to do". A steelman of such a "backlash" could be:

  1. Why work on this kind of alignment as opposed to another form that does not or is less likely to cause concentration of power in a few humans, for example AI that directly tries to satisfy humanity's overall values?
  2. According to some empirical and/or ethical views, such concentration of power could be worse than extinction, so maybe such alignment work is bad even if there is no viable alternative.

Not that I would necessarily agree with such a "backlash". I think I personally would be pretty conflicted (in the scenario where it looks like AI will cause major concentration of power) due to uncertainty about the relevant empirical and ethical views.

Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash.

I don't really think that's the case. 

Suppose that I have different taste from most people, and consider the interior of most houses ugly. I can be unhappy about the situation even if I ultimately end up in a house I don't think is ugly. I'm unhappy that I had to use multiple bits of selection pressure just to avoid ugly interiors, and that I spend time in other people's ugly houses, and so on.

In practice I think it's even worse than that; people get politically worked up about things that don't affect their lives at all through a variety of channels.

I do agree that backlash to X will be bigger if all AIs do X than if some AIs do X.

But in that scenario, AI could lead to an unprecedented ability to concentrate power (in the hands of AI developers or governments), which seems to be a reasonable concern for people to have.

I don't think this scenario is really relevant to the most common concerns about concentration of power. I think the most important reason to be scared of concentration of power is:

  • Historically you need a lot of human labor to get things done.
  • With AI the value of human labor may fall radically.
  • So capitalists may get all the profit, and it may be possible to run an oppressive state without a bunch of humans.
  • This may greatly increase economic inequality or make it much more possible to have robust oppressive regimes.

But all of those arguments are unrelated to the number of AI developers.

Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don't see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI).

A steelman of such a "backlash" could be:

  1. Why work on this kind of alignment as opposed to another form that does not or is less likely to cause concentration of power in a few humans, for example AI that directly tries to satisfy humanity's overall values?

I don't think it's really plausible to have a technical situation where AI can be used to pursue "humanity's overall values" but cannot be used to pursue the values of a subset of humanity.

(I also tend to think that technocratic solutions to empower humanity via the design of AI are worse than solutions that empower people in more legible ways, either by having their AI agents participate in legible institutions or by having AI systems themselves act as agents of legible institutions. I have some similar concerns to those raised by Glen Weyl here though I disagree on many particulars, and think we should generally focus efforts in this space on mechanisms that don't predictably shift power to people who make detailed technical decisions about the design of AI that aren't legible to most people.)

2. According to some empirical and/or ethical views, such concentration of power could be worse than extinction, so maybe such alignment work is bad even if there is no viable alternative.

If someone's position is "alignment might prevent total human disempowerment, but it's better for humans to all be disempowered than for some humans to retain power" then I think they should make that case directly. I don't have personally have that much sympathy for that position, don't think it would play well with the public, and don't think it's closely related to the kind of backlash I'm imagining in the OP.

Stepping back, the version of this I can most understand is: some people might really dislike some effects of AI, and might justifiably push back on all research that helps facilitate that including research that reduces risks from AI (since that research makes the development of AI more appealing). But for the most part I think that energy can and should be directed to directly blocking problematic applications of AI, or AI development altogether, rather than measures that would reduce the risk of AI.

Another related concern might be that AI will "by default" have some kind of truth-oriented disposition that is above human meddling and alignment is mostly just a tool to move from that default (empowering AI developers). But in practice I think both that the default disposition isn't so good, and also that AI developers have other crappier ways to change AI behavior (which are Pareto dominated) and so in practice this is pretty similar to the previous point.

Overall I expect there to be a small number of massive training runs due to economies of scale, but I also expect AI developer margins to be reasonable, and I don’t see a strong reason to expect them to end up with way more power than other actors in the supply chain (either the companies who supply computing power,or the downstream applications of AI).

Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins? What if they were to form a cartel/monopoly? Being the only source of cheaper and/or smarter than human labor would be extremely profitable, right?

Ok, perhaps that doesn't happen because forming cartels is illegal, or because very high prices might attract new entrants, but AI developers could implicitly or explicitly collude with each other in ways besides price, such as indoctrinating their AIs with the same ideology, which governments do not forbid and may even encourage. So you could have a situation where AI developers don't have huge economic power, but do have huge, unprecedented cultural power (similar today's academia, traditional media, and social media companies, except way more concentrated/powerful).

Compare this situation with a counterfactual one in which instead of depending on huge training runs, AIs were manually programmed and progress depended on slow accumulation of algorithmic insights over many decades, and as result there are thousands of AI developers tinkering with their own designs and not far apart in the capabilities of the AIs that they offer. In this world, it would be much less likely for any given customer to not be able to find a competitive AI that shares (or is willing to support) their political or cultural outlook.

(I also see realistic possibilities in which AI developers do naturally have very high margins, and way more power (of all forms) than other actors in the supply chain. Would be interested in discussing this further offline.)

I don’t think it’s really plausible to have a technical situation where AI can be used to pursue “humanity’s overall values” but cannot be used to pursue the values of a subset of humanity.

It seems plausible to me that the values of many subsets of humanity aren't even well defined. For example perhaps sustained moral/philosophical progress requires a sufficiently large and diverse population to be in contact with each other and at roughly equal power levels, and smaller subsets (if isolated or given absolute power over others) become stuck in dead-ends or go insane and never manage to reach moral/philosophical maturity.

So an alignment solution based on something like CEV might just not do anything for smaller groups (assuming it had a reliable of way of detecting such deliberation failures and performing a fail-safe).

Another possibility here is that if there was a technical solution for making an AI pursue humanity's overall values, it might become politically infeasible to use AI for some other purpose.

Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins?

Yes.

What if they were to form a cartel/monopoly? Being the only source of cheaper and/or smarter than human labor would be extremely profitable, right?

A monopoly on computers or electricity could also take big profits in this scenario. I think the big things are always that it's illegal and that high prices drive new entrants.

but AI developers could implicitly or explicitly collude with each other in ways besides price, such as indoctrinating their AIs with the same ideology, which governments do not forbid and may even encourage

I think this would also be illegal if justified by the AI company's preferences rather than customer preferences, and it would at least make them a salient political target for people who disagree. It might be OK if they were competing to attract employees/investors/institutional customers and in practice I think it would be most likely happen as a move by the dominant faction in political/cultural conflict in a broader society, and this would be a consideration raising the importance of AI researchers and potentially capitalists in that conflict.

I agree if you are someone who stands to lose from that conflict then you may be annoyed by some near-term applications of alignment, but I still think (i) alignment is distinct from those applications even if it facilitates them, (ii) if you don't like how AI empowers your political opponents then I strongly think you should push back on AI development itself rather than hoping that no one can control AI.

I'd appreciate if openai could clarify prominently in their UI that chatgpt is aligned to openai's selection of raters, so that users can direct their anger at openai, rather than at the concept of steering a model. The backlash against alignment itself could be quite dangerous, if it leads to those who find chatgpt's limitations frustrating choosing to not try to make their models aligned even with themselves.

I strongly agree with the message in this post, but think the title is misleading. When I read it, it seemed to imply that alignment is distinct from near-term alignment concerns, while after having read it, it's specifically about how AI is used in the near-term. A title like "AI Alignment is distinct from how it is used in the near-term" would feel better by me.

I'm concerned about this, because I think the long-term vs near-term safety distinctions are somewhat overrated, and really wish these communities would collaborate more and focus more on the common ground! But the distinction is a common view-point, and what this title pattern matched to.

(Partially inspired by Stephen Casper's post)

I also interpreted it this way and was confused for a while. I think your suggested title is clearer, Neel.

"AI developers should not develop and deploy systems with a significant risk of killing everyone."

If you were looking at GPT-4, what criteria would you use to evaluate whether it had a significant risk of killing everyone? 

I'd test whether it has the capability to do so if it were trying (as ARC is starting to do). Then I'd think about potential reasons that our training procedure would lead it to try to take over (e.g. as described here or here or in old writing).

If models might be capable enough to take over if they tried, and if the training procedures plausibly lead to takeover attempts, I'd probably flip the burden of proof ask developers to explain what evidence leads them to be confident that these systems won't try to take over (what actual experiments did they perform to characterize model behavior, and how well would it detect the plausible concerns?) and then evaluate that evidence.

For GPT-4 I think the main step is just that it's not capable enough to do so. If it were capable enough, I think the next step would be measuring whether it ever engages in abrupt shifts to novel reward-hacking behaviors---if it robustly doesn't then we can be more confident it won't take over (when combined with more mild "on-distribution" measurement), and if it does then we can start to ask what determines when that happens or what mitigations effectively address it.

When you say, "take over", what do you specifically mean? In the context of a GPT descendent, would take over imply it's doing something beyond providing a text output for a given input? Like it's going out of its way to somehow minimize the cross-entropy loss with additional GPUs, etc.? 

Takeover seems most plausible when humans have deployed the system as an agent, whose text outputs are treated as commands that are sent to actuators (like bash), and which chooses outputs in order to achieve desired outcomes. If you have limited control over what outcome the system is pursuing, you can end up with useful consequentialists who have a tendency to take over (since it's an effective way to pursue a wide range of outcomes, including natural generalizations of the ones it was selected to pursue during training).

A few years ago it was maybe plausible to say that "that's not how people use GPT, they just ask it questions." Unfortunately (but predictably) that has become a common way to use GPT-4, and it is rapidly becoming more compelling as the systems improve. I think that in the relatively near future if you want an AI to help you with coding, rather than just getting completions you might say "Hey this function seems to be taking too long, could you figure out what's going on?" and the AI will e.g. do a bisection for you, set up a version of your code running in a test harness, ask you questions about desired behavior, and so on.

I don't think "the system gets extra GPUs to minimize cross entropy loss" in particular is very plausible. (Could happen, but not a high enough probability to be worth worrying about.)

Agreed, this seems to be the path that OpenAI is taking with plugins.

Saying that a system would possibly kill everyone is a strong claim, but you do not provide any details on why this would be the case.

At the same time you say you are against censorship, but will somehow support it if it will prevent the worst case scenario.

I guess everyone readibg will have their own ideas (race war, proliferation of cheaply made biological weapons, mass tax avoidance etc), but can you please elaborate and provide more details on why 10-20%?

I just state my view rather than arguing for it; it's a common discussion topic on LW and on my blog. For some articles where people make the case in a self-contained way see Without specific countermeasures the easiest path to transformative AI likely leads to AI takeover or AGI safety from first principles.

At the same time you say you are against censorship, but will somehow support it if it will prevent the worst case scenario.

I'm saying that I will try to help people get AI to do what they want. I mostly think that's good both now and in the future. There will certainly be some things people want their AI to do that I'll dislike but I don't think "no one can control AI" is very helpful for avoiding that and comes with other major costs (even today).

(Compared to the recent batch of SSC commenters I'm also probably less worried about the "censorship" that is happening today; the current extent seems overstated and I think people are making overly pessimistic about its likely future, overall I think this is way less of an issue than other limits on free speech right now that could be more appropriately described as "censorship.")

AI developers do not share my values.

If the AIs they create are aligned to them, they definitely will not do what I want.

If they aren't, they might.

So the problem to solve first is "AI Developer Alignment."

People who don't share your values can still build products you like. And amongst products that get made you can choose the ones you most like. Indeed, that's the way that the world usually works.

AI "alignment" seems to have transmuted into "forcing AI to express the correct politics" so subtly and quickly that no one noticed. Does anyone even care about the X-risk thing any more, or is it all now about making sure ChatGPT doesn't oppose abortion or whatever?

I think most people working on alignment care more about long term risks more than ensuring existing AIs express particular political opinions.

I don't think your comment is really accurate even as a description of alignment work on ChatGPT. Honesty and helpfulness are still each bigger deals than political correctness.

Essentially all of us on this particular website care about the X-risk side of things, and by far the majority of alignment content on this site is about that.