Epistemic status: trying to vaguely gesture at vague intuitions. A similar idea was explored here under the heading "the intelligibility of intelligence", although I hadn't seen it before writing this post. As of 2020, I consider this follow-up comment to be a better summary of the thing I was trying to convey with this post than the post itself. The core disagreement is about how much we expect the limiting case of arbitrarily high intelligence to tell us about the AGIs whose behaviour we're worried about.
There’s a mindset which is common in the rationalist community, which I call “realism about rationality” (the name being intended as a parallel to moral realism). I feel like my skepticism about agent foundations research is closely tied to my skepticism about this mindset, and so in this essay I try to articulate what it is.
Humans ascribe properties to entities in the world in order to describe and predict them. Here are three such properties: "momentum", "evolutionary fitness", and "intelligence". These are all pretty useful properties for high-level reasoning in the fields of physics, biology and AI, respectively. There's a key difference between the first two, though. Momentum is very amenable to formalisation: we can describe it using precise equations, and even prove things about it. Evolutionary fitness is the opposite: although nothing in biology makes sense without it, no biologist can take an organism and write down a simple equation to define its fitness in terms of more basic traits. This isn't just because biologists haven't figured out that equation yet. Rather, we have excellent reasons to think that fitness is an incredibly complicated "function" which basically requires you to describe that organism's entire phenotype, genotype and environment.
In a nutshell, then, realism about rationality is a mindset in which reasoning and intelligence are more like momentum than like fitness. It's a mindset which makes the following ideas seem natural:
- The idea that there is a simple yet powerful theoretical framework which describes human intelligence and/or intelligence in general. (I don't count brute force approaches like AIXI for the same reason I don't consider physics a simple yet powerful description of biology).
- The idea that there is an “ideal” decision theory.
- The idea that AGI will very likely be an “agent”.
- The idea that Turing machines and Kolmogorov complexity are foundational for epistemology.
- The idea that, given certain evidence for a proposition, there's an "objective" level of subjective credence which you should assign to it, even under computational constraints.
- The idea that Aumann's agreement theorem is relevant to humans.
- The idea that morality is quite like mathematics, in that there are certain types of moral reasoning that are just correct.
- The idea that defining coherent extrapolated volition in terms of an idealised process of reflection roughly makes sense, and that it converges in a way which doesn’t depend very much on morally arbitrary factors.
- The idea that having having contradictory preferences or beliefs is really bad, even when there’s no clear way that they’ll lead to bad consequences (and you’re very good at avoiding dutch books and money pumps and so on).
To be clear, I am neither claiming that realism about rationality makes people dogmatic about such ideas, nor claiming that they're all false. In fact, from a historical point of view I’m quite optimistic about using maths to describe things in general. But starting from that historical baseline, I’m inclined to adjust downwards on questions related to formalising intelligent thought, whereas rationality realism would endorse adjusting upwards. This essay is primarily intended to explain my position, not justify it, but one important consideration for me is that intelligence as implemented in humans and animals is very messy, and so are our concepts and inferences, and so is the closest replica we have so far (intelligence in neural networks). It's true that "messy" human intelligence is able to generalise to a wide variety of domains it hadn't evolved to deal with, which supports rationality realism, but analogously an animal can be evolutionarily fit in novel environments without implying that fitness is easily formalisable.
Another way of pointing at rationality realism: suppose we model humans as internally-consistent agents with beliefs and goals. This model is obviously flawed, but also predictively powerful on the level of our everyday lives. When we use this model to extrapolate much further (e.g. imagining a much smarter agent with the same beliefs and goals), or base morality on this model (e.g. preference utilitarianism, CEV), is that more like using Newtonian physics to approximate relativity (works well, breaks down in edge cases) or more like cavemen using their physics intuitions to reason about space (a fundamentally flawed approach)?
Another gesture towards the thing: a popular metaphor for Kahneman and Tversky's dual process theory is a rider trying to control an elephant. Implicit in this metaphor is the localisation of personal identity primarily in the system 2 rider. Imagine reversing that, so that the experience and behaviour you identify with are primarily driven by your system 1, with a system 2 that is mostly a Hansonian rationalisation engine on top (one which occasionally also does useful maths). Does this shift your intuitions about the ideas above, e.g. by making your CEV feel less well-defined? I claim that the latter perspective is just as sensible as the former, and perhaps even more so - see, for example, Paul Christiano's model of the mind, which leads him to conclude that "imagining conscious deliberation as fundamental, rather than a product and input to reflexes that actually drive behavior, seems likely to cause confusion."
These ideas have been stewing in my mind for a while, but the immediate trigger for this post was a conversation about morality which went along these lines:
R (me): Evolution gave us a jumble of intuitions, which might contradict when we extrapolate them. So it’s fine to accept that our moral preferences may contain some contradictions.
O (a friend): You can’t just accept a contradiction! It’s like saying “I have an intuition that 51 is prime, so I’ll just accept that as an axiom.”
R: Morality isn’t like maths. It’s more like having tastes in food, and then having preferences that the tastes have certain consistency properties - but if your tastes are strong enough, you might just ignore some of those preferences.
O: For me, my meta-level preferences about the ways to reason about ethics (e.g. that you shouldn’t allow contradictions) are so much stronger than my object-level preferences that this wouldn’t happen. Maybe you can ignore the fact that your preferences contain a contradiction, but if we scaled you up to be much more intelligent, running on a brain orders of magnitude larger, having such a contradiction would break your thought processes.
R: Actually, I think a much smarter agent could still be weirdly modular like humans are, and work in such a way that describing it as having “beliefs” is still a very lossy approximation. And it’s plausible that there’s no canonical way to “scale me up”.
I had a lot of difficulty in figuring out what I actually meant during that conversation, but I think a quick way to summarise the disagreement is that O is a rationality realist, and I’m not. This is not a problem, per se: I'm happy that some people are already working on AI safety from this mindset, and I can imagine becoming convinced that rationality realism is a more correct mindset than my own. But I think it's a distinction worth keeping in mind, because assumptions baked into underlying worldviews are often difficult to notice, and also because the rationality community has selection effects favouring this particular worldview even though it doesn't necessarily follow from the community's founding thesis (that humans can and should be more rational).
In this essay, ricraz argues that we shouldn't expect a clean mathematical theory of rationality and intelligence to exist. I have debated em about this, and I continue to endorse more or less everything I said in that debate. Here I want to restate some of my (critical) position by building it from the ground up, instead of responding to ricraz point by point.
When should we expect a domain to be "clean" or "messy"? Let's look at everything we know about science. The "cleanest" domains are mathematics and fundamental physics. There, we have crisply defined concepts and elegant, parsimonious theories. We can then "move up the ladder" from fundamental to emergent phenomena, going through high energy physics, molecular physics, condensed matter physics, biology, geophysics / astrophysics, psychology, sociology, economics... On each level more "mess" appears. Why? Occam's razor tells us that we should prioritize simple theories over complex theories. But, we shouldn't expect a theory to be more simple than the specification of the domain. The general theory of planets should be simpler than a detailed description of planet Earth, the general theory of atomic matter should be simpler than the theory of planets, the general theory of everything should be simpler than the theory of atomic matter. That's because when we're "moving up the ladder", we are actually zooming in on particular phenomena, and the information we need to specify "where to zoom in" is translated to the description complexity of theory.
What does it mean in practice about understanding messy domains? The way science solves this problem is by building a tower of knowledge. In this tower, each floor benefits from the interactions both with the floor above it and the floor beneath it. Without understanding macroscopic physics we wouldn't figure out atomic physics, and without figuring out atomic physics we wouldn't figure out high energy physics. This is knowledge "flowing down". But knowledge also "flows up": knowledge of high energy physics allows understanding particular phenomena in atomic physics, knowledge of atomic physics allows predicting the properties of materials and chemical reactions. (Admittedly, some floors in the tower we have now are rather ramshackle, but I think that ultimately the "tower method" succeeds everywhere, as much as success is possible at all).
How does mathematics come in here? Importantly, mathematics is not used only on the lower floors of the tower, but on all floors. The way "messiness" manifests is, the mathematical models for the higher floors are either less quantitatively accurate (but still contain qualitative inputs) or have a lot of parameters that need to be determined either empirically, or using the models of the lower floors (which is one way how knowledge flows up), or some combination of both. Nevertheless, scientists continue to successfully build and apply mathematical models even in "messy" fields like biology and economics.
So, what does it all mean for rationality and intelligence? On what floor does it sit? In fact, the subject of rationality of intelligence is not a single floor, but its own tower (maybe we should imagine science as a castle with many towers connected by bridges).
The foundation of this tower should be the general abstract theory of rationality. This theory is even more fundamental than fundamental physics, since it describes the principles from which all other knowledge is derived, including fundamental physics. We can regard it as a "theory of everything": it predicts everything by making those predictions that a rational agent should do. Solomonoff's theory and AIXI are a part of this foundation, but not all it. Considerations like computational resource constraints should also enter the picture: complexity theory teaches us that they are also fundamental, they don't requiring "zooming in" a lot.
But, computational resource constrains are only entirely natural when they are not tied to a particular model of computation. This only covers constraints such as "polynomial time" but not constraints such as O(n3) time and even less so 245n3 time. Therefore, once we introduce a particular model of computation (such as a RAM machine), we need to build another floor in the tower, one that will necessarily be "messier". Considering even more detailed properties of the hardware we have, the input/output channels we have, the goal system, the physical environment and the software tools we employ will correspond to adding more and more floors.
Once we agree that it shoud be possible to create a clean mathematical theory of rationality and intelligence, we can still debate whether it's useful. If we consider the problem of creating aligned AGI from an engineering perspective, it might seem for a moment that we don't really need the bottom layers. After all, when designing an airplane you don't need high energy physics. Well, high energy physics might help indirectly: perhaps it allowed predicting some exotic condensed matter phenomenon which we used to make a better power source, or better materials from which to build the aircraft. But often we can make do without those.
Such an approach might be fine, except that we also need to remember the risks. Now, safety is part of most engineering, and is definitely a part of airplane design. What level of the tower does it require? It depends on the kind of risks you face. If you're afraid the aircraft will not handle the stress and break apart, then you need mechanics and aerodynamics. If you're afraid the fuel will combust and explode, you better know chemistry. If you're afraid a lightning will strike the aircraft, you need knowledge of meteorology and electromagnetism, possibly plasma physics as well. The relevant domain of knowledge, and the relevant floor in the tower is a function of the nature of the risk.
What level of the tower do we need to understand AI risk? What is the source of AI risk? It is not in any detailed peculiarities of the world we inhabit. It is not in the details of the hardware used by the AI. It is not even related to a particular model of computation. AI risk is the result of Goodhart's curse, an extremely general property of optimization systems and intelligent agents. Therefore, addressing AI risk requires understanding the general abstract theory of rationality and intelligence. The upper floors will be needed as well, since the technology itself requires the upper floors (and since we're aligning with humans, who are messy). But, without the lower floors the aircraft will crash.