Epistemic status: trying to vaguely gesture at vague intuitions. A similar idea was explored here under the heading "the intelligibility of intelligence", although I hadn't seen it before writing this post. As of 2020, I consider this follow-up comment to be a better summary of the thing I was trying to convey with this post than the post itself. The core disagreement is about how much we expect the limiting case of arbitrarily high intelligence to tell us about the AGIs whose behaviour we're worried about.
There’s a mindset which is common in the rationalist community, which I call “realism about rationality” (the name being intended as a parallel to moral realism). I feel like my skepticism about agent foundations research is closely tied to my skepticism about this mindset, and so in this essay I try to articulate what it is.
Humans ascribe properties to entities in the world in order to describe and predict them. Here are three such properties: "momentum", "evolutionary fitness", and "intelligence". These are all pretty useful properties for high-level reasoning in the fields of physics, biology and AI, respectively. There's a key difference between the first two, though. Momentum is very amenable to formalisation: we can describe it using precise equations, and even prove things about it. Evolutionary fitness is the opposite: although nothing in biology makes sense without it, no biologist can take an organism and write down a simple equation to define its fitness in terms of more basic traits. This isn't just because biologists haven't figured out that equation yet. Rather, we have excellent reasons to think that fitness is an incredibly complicated "function" which basically requires you to describe that organism's entire phenotype, genotype and environment.
In a nutshell, then, realism about rationality is a mindset in which reasoning and intelligence are more like momentum than like fitness. It's a mindset which makes the following ideas seem natural:
- The idea that there is a simple yet powerful theoretical framework which describes human intelligence and/or intelligence in general. (I don't count brute force approaches like AIXI for the same reason I don't consider physics a simple yet powerful description of biology).
- The idea that there is an “ideal” decision theory.
- The idea that AGI will very likely be an “agent”.
- The idea that Turing machines and Kolmogorov complexity are foundational for epistemology.
- The idea that, given certain evidence for a proposition, there's an "objective" level of subjective credence which you should assign to it, even under computational constraints.
- The idea that Aumann's agreement theorem is relevant to humans.
- The idea that morality is quite like mathematics, in that there are certain types of moral reasoning that are just correct.
- The idea that defining coherent extrapolated volition in terms of an idealised process of reflection roughly makes sense, and that it converges in a way which doesn’t depend very much on morally arbitrary factors.
- The idea that having having contradictory preferences or beliefs is really bad, even when there’s no clear way that they’ll lead to bad consequences (and you’re very good at avoiding dutch books and money pumps and so on).
To be clear, I am neither claiming that realism about rationality makes people dogmatic about such ideas, nor claiming that they're all false. In fact, from a historical point of view I’m quite optimistic about using maths to describe things in general. But starting from that historical baseline, I’m inclined to adjust downwards on questions related to formalising intelligent thought, whereas rationality realism would endorse adjusting upwards. This essay is primarily intended to explain my position, not justify it, but one important consideration for me is that intelligence as implemented in humans and animals is very messy, and so are our concepts and inferences, and so is the closest replica we have so far (intelligence in neural networks). It's true that "messy" human intelligence is able to generalise to a wide variety of domains it hadn't evolved to deal with, which supports rationality realism, but analogously an animal can be evolutionarily fit in novel environments without implying that fitness is easily formalisable.
Another way of pointing at rationality realism: suppose we model humans as internally-consistent agents with beliefs and goals. This model is obviously flawed, but also predictively powerful on the level of our everyday lives. When we use this model to extrapolate much further (e.g. imagining a much smarter agent with the same beliefs and goals), or base morality on this model (e.g. preference utilitarianism, CEV), is that more like using Newtonian physics to approximate relativity (works well, breaks down in edge cases) or more like cavemen using their physics intuitions to reason about space (a fundamentally flawed approach)?
Another gesture towards the thing: a popular metaphor for Kahneman and Tversky's dual process theory is a rider trying to control an elephant. Implicit in this metaphor is the localisation of personal identity primarily in the system 2 rider. Imagine reversing that, so that the experience and behaviour you identify with are primarily driven by your system 1, with a system 2 that is mostly a Hansonian rationalisation engine on top (one which occasionally also does useful maths). Does this shift your intuitions about the ideas above, e.g. by making your CEV feel less well-defined? I claim that the latter perspective is just as sensible as the former, and perhaps even more so - see, for example, Paul Christiano's model of the mind, which leads him to conclude that "imagining conscious deliberation as fundamental, rather than a product and input to reflexes that actually drive behavior, seems likely to cause confusion."
These ideas have been stewing in my mind for a while, but the immediate trigger for this post was a conversation about morality which went along these lines:
R (me): Evolution gave us a jumble of intuitions, which might contradict when we extrapolate them. So it’s fine to accept that our moral preferences may contain some contradictions.
O (a friend): You can’t just accept a contradiction! It’s like saying “I have an intuition that 51 is prime, so I’ll just accept that as an axiom.”
R: Morality isn’t like maths. It’s more like having tastes in food, and then having preferences that the tastes have certain consistency properties - but if your tastes are strong enough, you might just ignore some of those preferences.
O: For me, my meta-level preferences about the ways to reason about ethics (e.g. that you shouldn’t allow contradictions) are so much stronger than my object-level preferences that this wouldn’t happen. Maybe you can ignore the fact that your preferences contain a contradiction, but if we scaled you up to be much more intelligent, running on a brain orders of magnitude larger, having such a contradiction would break your thought processes.
R: Actually, I think a much smarter agent could still be weirdly modular like humans are, and work in such a way that describing it as having “beliefs” is still a very lossy approximation. And it’s plausible that there’s no canonical way to “scale me up”.
I had a lot of difficulty in figuring out what I actually meant during that conversation, but I think a quick way to summarise the disagreement is that O is a rationality realist, and I’m not. This is not a problem, per se: I'm happy that some people are already working on AI safety from this mindset, and I can imagine becoming convinced that rationality realism is a more correct mindset than my own. But I think it's a distinction worth keeping in mind, because assumptions baked into underlying worldviews are often difficult to notice, and also because the rationality community has selection effects favouring this particular worldview even though it doesn't necessarily follow from the community's founding thesis (that humans can and should be more rational).