Let's talk about "Convergent Rationality"

byDavid Krueger5d12th Jun 20198 comments


Ω 6

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

What this post is about: I'm outlining some thoughts on what I've been calling "convergent rationality". I think this is an important core concept for AI-Xrisk, and probably a big crux for a lot of disagreements. It's going to be hand-wavy! It also ended up being a lot longer than I anticipated.

Abstract: Natural and artificial intelligences tend to learn over time, becoming more intelligent with more experience and opportunity for reflection. Do they also tend to become more "rational" (i.e. "consequentialist", i.e. "agenty" in CFAR speak)? Steve Omohundro's classic 2008 paper argues that they will, and the "traditional AI safety view" and MIRI seem to agree. But I think this assumes an AI that already has a certain sufficient "level of rationality", and it's not clear that all AIs (e.g. supervised learning algorithms) will exhibit or develop a sufficient level of rationality. Deconfusion research around convergent rationality seems important, and we should strive to understand the conditions under which it is a concern as thoroughly as possible.

I'm writing this for at least these 3 reasons:

  • I think it'd be useful to have a term ("convergent rationality") for talking about this stuff.
  • I want to express, and clarify, (some of) my thoughts on the matter.
  • I think it's likely a crux for a lot of disagreements, and isn't widely or quickly recognized as such. Optimistically, I think this article might lead to significantly more clear and productive discussions about AI-Xrisk strategy and technical work.


  • Characterizing convergent rationality
  • My impression of attitudes towards convergent rationality
  • Relation to capability control
  • Relevance of convergent rationality to AI-Xrisk
  • Conclusions, some arguments pro/con convergent rationality

Characterizing convergent rationality

Consider a supervised learner trying to maximize accuracy. The Bayes error rate is typically non-0, meaning it's not possible to get 100% test accuracy just by making better predictions. If, however, the test data(/data distribution) were modified, for example to only contain examples of a single class, the learner could achieve 100% accuracy. If the learner were a consequentialist with accuracy as its utility function, it would prefer to modify the test distribution in this way in order to increase its utility. Yet, even when given the opportunity to do so, typical gradient-based supervised learning algorithms do not seem to pursue such solutions (at least in my personal experience as an ML researcher).

We can view the supervised learning algorithm as either ignorant of, or indifferent to, the strategy of modifying the test data. But we can also this behavior as a failure of rationality, where the learner is "irrationally" averse or blind to this strategy, by construction. A strong version of the convergent rationality thesis (CRT) would then predict that given sufficient capacity and "optimization pressure", the supervised learner would "become more rational", and begin to pursue the "modify the test data" strategy. (I don't think I've formulated CRT well enough to really call it a thesis, but I'll continue using it informally).

More generally, CRT would imply that deontological ethics are not stable, and deontologists must converge towards consequentialists. (As a caveat, however, note that in general environments, deontological behavior can be described as optimizing a (somewhat contrived) utility function (grep "existence proof" in the reward modeling agenda)). The alarming implication would be that we cannot hope to build agents that will not develop instrumental goals.

I suspect this picture is wrong. At the moment, the picture I have is: imperfectly rational agents will sometimes seek to become more rational, but there may be limits on rationality which the "self-improvement operator" will not cross. This would be analogous to the limit of ω which the "add 1 operator" approaches, but does not cross, in the ordinal numbers. In other words, order to reach "rationality level" ω+1, it's necessary for an agent to already start out at "rationality level" ω. A caveat: I think "rationality" is not uni-dimensional, but I will continue to write as if it is.

My impression of attitudes towards convergent rationality

Broadly speaking, MIRI seem to be strong believers in convergent rationality, but their reasons for this view haven't been very well-articulated (TODO: except the inner optimizer paper?). AI safety people more broadly seem to have a wide range of views, with many people disagreeing with MIRI's views and/or not feeling confident that they understand them well/fully.

Again, broadly speaking, machine learning (ML) people often seem to think it's a confused viewpoint bred out of anthropomorphism, ignorance of current/practical ML, and paranoia. People who are more familiar with evolutionary/genetic algorithms and artificial life communities might be a bit more sympathetic, and similarly for people who are concerned with feedback loops in the context of algorithmic decision making.

I think a lot of people with working on ML-based AI safety consider convergent rationality to be less relevant than MIRI does, because 1) so far it is more of a hypothetical/theoretical concern, whereas we've done a lot of and 2) current ML (e.g. deep RL with bells and whistles) seems dangerous enough because of known and demonstrated specification and robustness problems (e.g. reward hacking and adversarial examples).

In the many conversations I've had with people from all these groups, I've found it pretty hard to find concrete points of disagreement that don't reduce to differences in values (e.g. regarding long-termism), time-lines, or bare intuition. I think "level of paranoia about convergent rationality" is likely an important underlying crux.

Relation to capability control

A plethora of naive approaches to solving safety problems by limiting what agents can do have been proposed and rejected on the grounds that advanced AIs will be smart and rational enough to subvert them. Hyperbolically, the traditional AI safety view is that "capability control" is useless. Irrationality can be viewed as a form of capability control.

Naively, approaches which deliberately reduce an agent's intelligence or rationality should be an effective form of capability control method (I'm guessing that's a proposal in the Artificial Stupidity paper, but I haven't read it). If this were true, then we might be able to build very intelligent and useful AI systems, but control them by, e.g. making them myopic, or restricting the hypothesis class / search space. This would reduce the "burden" on technical solutions to AI-Xrisk, making it (even) more of a global coordination problem.

But CRT suggests that these methods of capability control might fail unexpectedly. There is at least one example (I've struggled to dig up) of a memory-less RL agent learning to encode memory information in the state of the world. More generally, agents can recruit resources from their environments, implicitly expanding their intellectual capabilities, without actually "self-modifying".

Relevance of convergent rationality to AI-Xrisk

Believing CRT should lead to higher levels of "paranoia". Technically, I think this should lead to more focus on things that look more like assurance (vs. robustness or specification). Believing CRT should make us concerned that non-agenty systems (e.g. trained with supervised learning) might start behaving more like agents.

Strategically, it seems like the main implication of believing in CRT pertains to situations where we already have fairly robust global coordination and a sufficiently concerned AI community. CRT implies that these conditions are not sufficient for a good prognosis: even if everyone using AI makes a good-faith effort to make it safe, if they mistakenly don't believe CRT, they can fail. So we'd also want the AI community to behave as if CRT were true unless or until we had overwhelming evidence that it was not a concern.

On the other hand, disbelief in CRT shouldn't allay our fears overly much; AIs need not be hyperrational in order to pose significant Xrisk. For example, we might be wiped out by something more "grey goo"-like, i.e. an AI that is basically a policy hyperoptimized for the niche of the Earth, and doesn't even have anything resembling a world(/universe) model, planning procedure, etc. Or we might create AIs that are like superintelligent humans: having many cognitive biases, but still agenty enough to thoroughly outcompete us, and considering lesser intelligences of dubious moral significance.

Conclusions, some arguments pro/con convergent rationality

My impression is that intelligence (as in IQ/g) and rationality are considered to be only loosely correlated. My current model is that ML systems become more intelligent with more capacity/compute/information, but not necessarily more rational. If this is true, is creates exciting prospects for forms of capability control. On the other hand, if CRT is true, this supports the practice of modelling all sufficiently advanced AIs as rational agents.

I think the main argument against CRT is that, from an ML perspective, it seems like "rationality" is more or less a design choice: we can make agents myopic, we can hard-code flawed environment models or reasoning procedures, etc.The main counter-arguments arise from VNMUT, which can be interpreted as saying "rational agents are more fit" (in an evolutionary sense). At the same time, it seems like the complexity of the real world (e.g. physical limits of communication and information processing) makes this a pretty weak argument. Humans certainly seem highly irrational, and distinguishing biases and heuristics can be difficult.

A special case of this is the "inner optimizers" idea. The strongest argument for inner optimizers I'm aware of goes like: "the simplest solution to a complex enough task (and therefor the easiest for weakly guided search, e.g. by SGD) is to instantiate a more agenty process, and have it solve the problem for you". The "inner" part comes from the postulate that a complex and flexible enough class of models will instantiate such a agenty process internally (i.e. using a subset of the model's capacity). I currently think this picture is broadly speaking correct, and is the third major (technical) pillar supporting AI-Xrisk concerns (along with Goodhart's law and instrumental goals).

The issues with tiling agents also suggest that the analogy with ordinals I made might be stronger than it seems; it may be impossible for an agent to rationally endorse a qualitatively different form of reasoning. Similarly, while "CDT wants to become UDT" (supporting CRT), my understanding is that it is not actually capable of doing so (opposing CRT) because "you have to have been UDT all along" (thanks to Jessica Taylor for explaining this stuff to me a few years back).

While I think MIRI's work on idealized reasoners has shed some light on these questions, I think in practice, random(ish) "mutation" (whether intentionally designed or imposed by the physical environment) and evolutionary-like pressures may push AIs across boundaries that the "self-improvement operator" will not cross, making analyses of idealized reasoners less useful than they might naively appear.

This article is inspired by conversations with Alex Zhu, Scott Garrabrant, Jan Leike, Rohin Shah, Micah Carrol, and many others over the past year and years.


Ω 6