Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A safety-capabilities tradeoff is when you have something like a dial on your AGI, and one end of the dial says “more safe but less capable”, and the other end of the dial says “less safe but more capable”.

Make no mistake: safety-capabilities tradeoff dials stink. But I want to argue that they are inevitable, and we better get used to them. I will argue that the discussion should be framed as “Just how problematic is this dial? How do we minimize its negative impact?”, not “This particular approach has a dial, so it’s automatically doomed. Let’s throw it out and talk about something else instead.”

(Recent examples of the latter attitude, at least arguably: here, here.)

1. Background (if it’s not obvious): why do safety-capabilities tradeoff dials stink?

The biggest problem is that the world has ambitious people and competitive people and optimistic people and people desperate to solve problems (like climate change, or the fact that their child is dying of cancer, or the fact that their company is about to go bust and they’ll have to lay off thousands of employees who are depending on them, etc. etc.). Also, we don’t have research liability insurance that internalizes the societal cost of catastrophic accidents. And even if we did, I don’t think that would work for catastrophes so bad that no humans would be left to sue the insurer, or so weird and new that nobody knows how to price them, or if the insurance is so expensive that AGI research shifts to a country where insurance is not required, or shifts underground, etc.

So I expect by default that a subset of people would set their dial to “less safe but more capable”, and then eventually we get catastrophic accidents with out-of-control AGIs self-replicating around the internet and all that.

2. Why do I say that these dials are inevitable?

Here are a few examples.

  • Pre-deployment testing. Will pre-deployment testing help with safety? Yeah, duh! (It won’t catch every problem, but I think it’s undeniable that it would help more than zero.) And will pre-deployment testing hurt capabilities? You bet! The time and money and compute and personnel on the Sandbox Testing Team is time and money and compute and personnel that are not training the next generation of more powerful AGIs. Well, that’s an oversimplification. Some amount of sandbox testing would help capabilities, by helping the team better understand how things are going. But there’s an optimal amount of sandbox testing for capabilities, and doing further testing beyond that point is a safety-capabilities tradeoff.
  • Humans in the loop. If we run the model without ever pausing to study and test it, and allow the AGI to execute plans that we humans do not understand or even cannot understand, it will be able to accomplish things faster and less safely.
  • Resources. As we give an AGI things like full unfiltered internet access and money and extra computers to use and so on, it will make the AGI more capable, but also less safe (since an alignment failure would more quickly and reliably turn into an unrecoverable catastrophe).
  • Following human norms and laws. Presumably we want people to try to make AGIs that follow the letter and spirit of all laws and social customs (cf. “Following Human Norms”). There’s a sliding scale of how conservative the AGI might be in those respects. Higher conservatism reduces the risk of dangerous and undesired behavior, but also prevents the AGI from taking some actions that would have been very good and effective. So this is yet another safety-capabilities tradeoff.

3. A better way to think about it: “alignment tax”

I like the "alignment tax" framing (the terminology comes from Eliezer Yudkowsky via Paul Christiano). (Maybe it would have been slightly better to call it “safety tax” in the context of this post, but whatever.)

  • If there’s no tradeoff whatsoever between safety and capabilities, that would be an "alignment tax" of 0%—the best case.
  • If we know how to make unsafe AGIs but we don't know how to make safe AGIs, that would be an “alignment tax” of infinity%—the worst case.

So if we’re talking about a proposal that entails a safety-capabilities tradeoff dial, the first questions I would ask are:

  • Does this safety proposal entail an alignment tax of infinity%? That is, are there plans / ideas that an unsafe AI could come up with and execute, and a safe AI can't, not even with extra time / money / compute / whatever?
  • If so, would the capability ceiling for “safe AI” be so low that the safe AI is not even up to the task of doing superhuman AGI safety research? (i.e., is the “safe AI” inadequate for use in a “bootstrapping” approach to safe powerful AGI?)

If yes to both, that’s VERY BAD. If any safety proposal is like that, it goes straight into the garbage!

Then the next question to ask is: if the alignment tax is less than infinity, that’s a good start, but just how high or low is the tax? There’s no right answer anymore: it’s just “The lower the better”.

After all, we would like a world of perfect compliance, where all relevant actors are always paying the alignment tax in order to set their dial to “safe”. I don’t think this is utterly impossible. There are three things that can work together to make it happen:

  • As-low-as-possible alignment tax
  • People’s selfish desire to have their own AGIs to stay safe and under control.
  • _____________    [this bullet point is deliberately left blank, to be filled in by some coordination mechanism / legal framework / something-or-other that the AGI strategy / governance folks will hopefully eventually come up with!!]
  • Updated to add: For that last one, it would also help if the settings of the important safety-capabilities tradeoff dials are somehow externally legible / auditable by third parties. (Thanks Koen Holtman in the comments.)

 

(Thanks Alex Turner & Miranda Dixon-Luinenburg for comments on a draft.)

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 9:52 PM

I agree with your central observation that Safety-capabilities tradeoff dials are inevitable in AGI. It is useless to search only a safety mechanism that all purely selfish owner of AGIs will use voluntarily, at a safe setting, even when such selfish owners are engaged in an arms race.

However, I disagree with another point you make:

Then the next question to ask is: if the alignment tax is less than infinity, that’s a good start, but just how high or low is the tax? There’s no right answer anymore: it’s just “The lower the better”.

The right answer is definitely not 'the lower the better'. I'd say that your are framing cost of the alignment tax as 'the tax is higher if it gets more in the way of selfish peoples' desire to fully pursue their selfish ends' . In this, you are following the usual 'alignment mechanisms need to be commercially viable' framing on this forum. But with this limited framing, lower alignment taxes will not necessarily imply better outcomes for all of society.

What you need to do is to examine the blank in your third bullet point above more carefully, the blank line you expect the AGI strategy / governance folks to hopefully fill in. What these AGI strategy / governance folks are looking for is clear: they want to prevent destructive arms races, and races-to-the-bottom-of-safety, from happening between somewhat selfish actors, where these actors may be humans or organisations run by humans. This prevention can only happen is there is a countervailing force to all this selfishness.

(The alternative to managing a countervailing force would be to put every selfish human in a re-education camp, and hope that it works. Not the kind of solution that the strategy/governance folks are looking for, but I am sure there is an SF novel somewhere about a re-education camp run by an AGI. Another alternative would be to abandon the moral justification for the market system and markets altogether, abandon the idea of having any system of resource allocation where self-interested actions can be leveraged to promote the common good. Again, not really what most governance folks are looking for,)

Strategy/governance folks are not necessarily looking for low-cost dials as you define them, they are looking for AGI safety dials that they can all force selfish actors to use, as part of the law or of a social contract (where a social contract can be between people but also between governments or government-shaped entities). To enforce the use of a certain dial at a certain setting, one must be able to detect if an actor is breaking the law or social contract by not using the dial, so that sanctions can be applied. The easier and more robust this detection process is, the more useful the dial is.

The cost of being forced to use the dial to the selfish actors is only a secondary consideration. The less self-aware of these selfish actors can be relied on to always complain about being bound to any law or social contract whatsoever.

The I part I'll agree with is: If we look at a dial, we can ask the question:

If there's an AGI with a safety-capabilities tradeoff dial, to what extent is the dial's setting externally legible / auditable to third parties?

More legible / auditable is better, because it could help enforcement.

I agree with this, and I have just added it to the article. But I disagree with your suggestion that this is counter to what I wrote. In my mind, it's an orthogonal dimension along which dials can vary. I think it's good if the dial is auditable, and I think it's also good if the dial corresponds to a very low alignment tax rate.

I interpret your comment as saying that the alignment tax rate doesn't matter because there will be enforcement, but I disagree with that. I would invoke an analogy to actual taxes. It is already required and enforced that individuals and companies pay (normal) taxes. But everyone knows that a 0.1% tax on Thing X will have a higher compliance rate than an 80% tax on Thing X, other things equal.

After all, everyone is making decisions about whether to pay the tax, versus not pay the tax. Not paying the tax has costs. It's a cost to hire lawyers that can do complicated accounting tricks. It's a cost to run the risk of getting fined or imprisoned. It's a cost to pack up your stuff and move to an anarchic war zone, or to a barge in the middle of the ocean, etc. It's a cost to get pilloried in the media for tax evasion. People will ask themselves: are these costs worth the benefits? If the tax is 0.1%, maybe it's not worth it, maybe it's just way better to avoid all that trouble by paying the tax. If the tax is 80%, maybe it is worth it to engage in tax evasion.

So anyway, I agree that "there will be good enforcement" is plausibly part of the answer. But good enforcement plus low tax will sum up to higher compliance than good enforcement by itself. Unless you think "perfect watertight enforcement" is easy, so that "willingness to comply" becomes completely irrelevant. That strikes me as overly optimistic. Perfect watertight enforcement of anything is practically nonexistent in this world. Perfect watertight enforcement of experimental AGI research would strike me as especially hard. After all, AGI research is feasible to do in a hidden basement / anarchic war zone / barge in the middle of the ocean / secret military base / etc. And there are already several billion GPUs untraceably dispersed all across the surface of Earth.

Based on what you say above, I do not think we fundamentally disagree. There are orthogonal dimensions to safety mechanism design which are all important.

I somewhat singled out your line of 'the lower the better' because I felt that your taxation framing was too one-dimensional.

There is another matter: in US/UK political discourse, it common that if someone wants to prevent the government from doing something useful, this something will be framed as a tax, or as interfering with economic efficiency. If someone does want the government to actually do a thing, in fact spend lavishly on doing it, the same thing will often be framed as enforcement. This observation says something about the quality of the political discourse. But as a continental European, it is not the quality of the discourse I want to examine here, only the rhetorical implications.

When you frame your safety dials as taxation, then rhetorically you are somewhat shooting yourself in the foot, if you want proceed by arguing that these dials should not be thrown out of the discussion.

When re-framed as enforcement, the cost of using these safety dials suddenly does not sound as problematic anymore.

But enforcement, in a way that limits freedom of action, is indeed a burden to those at the receiving end, and if enforcement is too heavy they might seek to escape it altogether. I agree that perfectly inescapable watertight enforcement is practically nonexistent in this world, in fact I consider its non-existence to be more of a desirable feature of society than it is a bug.

But to use your terminology, the level of enforcement applied to something is just one of these tradeoff dials that stink. That does not mean we should throw out the dial.

Just posted an analysis of the epistemic strategies used in this post, which helps making the reasoning more explicit IMO.