Crossposted from my personal blog.

It is often argued that any alignment technique that works primarily by constraining the capabilities of an AI system to be within some bounds cannot work because it imposes too high an 'alignment tax' on the ML system. The argument is that people will either refuse to apply any method that has an alignment tax, or else they will be outcompeted by those who do. I think that this argument is applied too liberally and often without consideration for several key points:

1.) 'Capabilities' is not always a dial with two settings 'more' and 'less'. Capabilities are highly multifaceted and certain aspects of capabilities can be taxed or constrained without affecting others. Often, it is precisely these constraints that make the AI system economically valuable in the first place. We have seen this story play out very recently with language models where techniques that strongly constrain capabilities such as instruct finetuning and RLHF are, in fact, what create the economic value. Base LLMs are pretty much useless in practice for most economic tasks, and RLHFd and finetuned LLMs are much more useful even though the universe of text that they can generate has been massively constrained. It just so happens that the constrained universe has a mnuch greater proportion of useful text than the unconstrained universe of the base LLM. People are often, rationally, very willing to trade off capability and generalizability for reliability in practice.

2.) 'Capabilities' are not always good from our perspective economically. Many AGI doom scenarios require behaviour and planning that would be extremely far from what there would be essentially any economic value to any current actors for doing. As an extreme case, the classic paperclipper scenario typically arises because the model calculates that if it kills all humans it gets to tile the universe with paperclips in billions of years. Effectively, it Pascal's mugs itself over the dream of universal paperclips. Having an AGI that can plan billions of years in the future is valuable to nobody today compared to one with a much, much, shorter planning horizon. Constraining this 'capability' has an essentially negligible alignment tax.

3.) Small alignment taxes being intolerable is an efficient market argument and the near-term AGI market is likely to be extremely inefficient. Specifically, it appears likely to be dominated by a few relatively conservative tech behemoths. The current brewing arms race between Google and Microsoft/OpenAI is bad for this but notably this is the transition from there being *literally no competition* to *any competition at all*. Economic history also shows us that the typical results of setups like this is that the arms race will quickly defuse into a cosy and slow oligopoly. Even now there is still apparently huge slack. OpenAI have almost certaintly been sitting on GPT4 for many months before partially releasing it as Bing. Google have many many unreleased large language models including almost certainly SOTA ones.

4.) Alignment taxes can (and should) be mandated by governments. Having regulations slow development and force safety protocols to be implemented is not a radical proposal and is in fact the case in many other industries where it can completely throttle progress (i.e. nuclear with much less reason for concern). This should clearly be a focus for policy efforts. 

New Comment
10 comments, sorted by Click to highlight new comments since:

Having an AGI that can plan billions of years in the future is valuable to nobody today compared to one with a much, much, shorter planning horizon. Constraining this ‘capability’ has an essentially negligible alignment tax.

Well, sure, if someone can figure out how to limit an AGI to plans that span only 3 months, that would be welcome, but what makes you think that anyone is going to figure that out before the end?

What progress has been made towards that welcome end in the 19 years or so since researchers started publishing on AI alignment?

This can be incentivised through an appropriate discount rate in the reward function?

Your comment makes me realize that I don't have enough knowledge to know how practical it is to impose such a limit on the length of plans.

But even if it is practical, the resulting AGI will still see the humans as a threat to its 3-month-long plan (e.g., the humans might create a second superintelligent AI during the 3 months) which is dangerous to the humans.

But maybe killing every last one of us is too much trouble with the result that the AGI 'merely' destroys our ability to generate electricity combined with creating enough chaos to prevent us from even starting to repair our electrical grid during the 3 months.

Also, the 3-month limit on plans would prevent the AGI from ending other species in other solar systems (because it takes more than 3 months to travel to other solar systems) after it ends the human species, which is an attractive feature relative to other AGI designs.


Numerous technical measures for myopia. CAIS being one. Another being a high discount rate in the reward heuristic.

This all seems basically sensible and checks out.

Re: your arguments #1, #2 and #4, we very well might make the decisions to pursue modular implementations of transformative artificial intelligence such as Drexler's Open Agency architecture or Comprehensive AI Services over autonomous sovereigns and accept the inefficiency from humans in the loop and modularity because:

  1. Modular architectures are much easier to oversee/govern (i.e. "scalable oversight" is more tractable)
  2. Correctness/robustness of particular components/services can be locally verified; modular architectures may be more reliable/trustworthy for this reason and thus more economically competitive
  3. Such implementations are less vulnerable/prone to (or at least offer less affordances for) "power seeking"/"influence seeking" behaviour; the risk of takeover and disempowerment is lower
  4. Misaligned AI is likely to cause small local failures before global catastrophic failures, and hostile sociocultural/political/regulatory reactions to such failures (see nuclear) could well incentivise the big AI labs to play it (very) safe lest they strangle their golden goose

Re: #3 many of the biggest/main labs have safety teams and seem to take existential risk from advanced artificial intelligence seriously:

  • Anthropic
  • Deepmind
  • OpenAI

I guess Google Brain and Meta AI stand out as big/well funded teams that aren't (yet) safety pilled.

I think 1 and 2 are good points. A big issue is that 2 might have problems with "picking up pennies in front of the steamroller," where increases in planning ability and long-term coherence are valuable all the way from minutes to years, so pushing the envelope might be incentivized even if going too far would lose economic value.

3 I think neglects that extra features are harder for individual actors too, not just in competition. Alignment taxes that just look like "extra steps that you have to get right" will predictably increase the difficulty of getting the steps right.

For 4, what kind of substance to the regulation are you imagining? I think government regulation could be useful for slowing down data and compute intensive development. But what standards would you use for detecting when someone is developing AI that needs to be safe, and what safety work do you want mandated? The EU's recent regulation is maybe a good place to start from, but it doesn't address AGI and it doesn't address the sort of safety work that AGI implies.

Economic history also shows us that the typical results of setups like this is that the arms race will quickly defuse into a cosy and slow oligopoly.

It took over half a century and enormous geopolitical changes for the nuclear arms race to 'defuse into a cosy and slow oligopoly.'

In principle, AI regulation sounds like the type of coordination problem that governments were designed to solve.

In practice… I once saw a clip of a US senator bringing in a snowball from outside the building to show that global warming wasn’t real. And that’s just the tip of the iceberg (pun intended) for the pretty egregious level of, i don’t even know what to call it, childishness I guess, of lots of the people in power.

And that’s just the incompetence side, not even mentioning the grift side.

With the current state of our legal institutions, I’m not sure we’d actually be better off with any regulation. I could be wrong, but I highly doubt there’d be a net benefit, and wouldn’t be surprised if it ends up actually benefitting the worst actors.

I can’t even imagine what a conversation trying to explain alignment tax to a legal body would look like.


To add to the above: consider the autonomous car and general robotics use cases. These have essentially zero economic value if reliability is so low they cause more damage than gain. (Damage to equipment, injuries to humans, damage to materials being handled).

"Sovereign" complex systems can't be debugged. If they are arbitrarily able to do anything, it isn't possible to constrain them to be productive.

Economic history also shows us that the typical results of setups like this is that the arms race will quickly defuse into a cosy and slow oligopoly.

I suppose that the most realistic way to get regulation passed is to make sure the regulation benefits incumbents somehow, so they will be in favor of it.