In this essay, ricraz argues that we shouldn't expect a clean mathematical theory of rationality and intelligence to exist. I have debated em about this, and I continue to endorse more or less everything I said in that debate. Here I want to restate some of my (critical) position by building it from the ground up, instead of responding to ricraz point by point.
When should we expect a domain to be "clean" or "messy"? Let's look at everything we know about science. The "cleanest" domains are mathematics and fundamental physics. There, we have crisply defined concepts and elegant, parsimonious theories. We can then "move up the ladder" from fundamental to emergent phenomena, going through high energy physics, molecular physics, condensed matter physics, biology, geophysics / astrophysics, psychology, sociology, economics... On each level more "mess" appears. Why? Occam's razor tells us that we should prioritize simple theories over complex theories. But, we shouldn't expect a theory to be more simple than the specification of the domain. The general theory of planets should be simpler than a detailed description of planet Earth, the general theory of atomic matter should be simpler than the theory of planets, the general theory of everything should be simpler than the theory of atomic matter. That's because when we're "moving up the ladder", we are actually zooming in on particular phenomena, and the information we need to specify "where to zoom in" is translated to the description complexity of theory.
What does it mean in practice about understanding messy domains? The way science solves this problem is by building a tower of knowledge. In this tower, each floor benefits from the interactions both with the floor above it and the floor beneath it. Without understanding macroscopic physics we wouldn't figure out atomic physics, and without figuring out atomic physics we wouldn't figure out high energy physics. This is knowledge "flowing down". But knowledge also "flows up": knowledge of high energy physics allows understanding particular phenomena in atomic physics, knowledge of atomic physics allows predicting the properties of materials and chemical reactions. (Admittedly, some floors in the tower we have now are rather ramshackle, but I think that ultimately the "tower method" succeeds everywhere, as much as success is possible at all).
How does mathematics come in here? Importantly, mathematics is not used only on the lower floors of the tower, but on all floors. The way "messiness" manifests is, the mathematical models for the higher floors are either less quantitatively accurate (but still contain qualitative inputs) or have a lot of parameters that need to be determined either empirically, or using the models of the lower floors (which is one way how knowledge flows up), or some combination of both. Nevertheless, scientists continue to successfully build and apply mathematical models even in "messy" fields like biology and economics.
So, what does it all mean for rationality and intelligence? On what floor does it sit? In fact, the subject of rationality of intelligence is not a single floor, but its own tower (maybe we should imagine science as a castle with many towers connected by bridges).
The foundation of this tower should be the general abstract theory of rationality. This theory is even more fundamental than fundamental physics, since it describes the principles from which all other knowledge is derived, including fundamental physics. We can regard it as a "theory of everything": it predicts everything by making those predictions that a rational agent should do. Solomonoff's theory and AIXI are a part of this foundation, but not all it. Considerations like computational resource constraints should also enter the picture: complexity theory teaches us that they are also fundamental, they don't requiring "zooming in" a lot.
But, computational resource constrains are only entirely natural when they are not tied to a particular model of computation. This only covers constraints such as "polynomial time" but not constraints such as O(n3) time and even less so 245n3 time. Therefore, once we introduce a particular model of computation (such as a RAM machine), we need to build another floor in the tower, one that will necessarily be "messier". Considering even more detailed properties of the hardware we have, the input/output channels we have, the goal system, the physical environment and the software tools we employ will correspond to adding more and more floors.
Once we agree that it shoud be possible to create a clean mathematical theory of rationality and intelligence, we can still debate whether it's useful. If we consider the problem of creating aligned AGI from an engineering perspective, it might seem for a moment that we don't really need the bottom layers. After all, when designing an airplane you don't need high energy physics. Well, high energy physics might help indirectly: perhaps it allowed predicting some exotic condensed matter phenomenon which we used to make a better power source, or better materials from which to build the aircraft. But often we can make do without those.
Such an approach might be fine, except that we also need to remember the risks. Now, safety is part of most engineering, and is definitely a part of airplane design. What level of the tower does it require? It depends on the kind of risks you face. If you're afraid the aircraft will not handle the stress and break apart, then you need mechanics and aerodynamics. If you're afraid the fuel will combust and explode, you better know chemistry. If you're afraid a lightning will strike the aircraft, you need knowledge of meteorology and electromagnetism, possibly plasma physics as well. The relevant domain of knowledge, and the relevant floor in the tower is a function of the nature of the risk.
What level of the tower do we need to understand AI risk? What is the source of AI risk? It is not in any detailed peculiarities of the world we inhabit. It is not in the details of the hardware used by the AI. It is not even related to a particular model of computation. AI risk is the result of Goodhart's curse, an extremely general property of optimization systems and intelligent agents. Therefore, addressing AI risk requires understanding the general abstract theory of rationality and intelligence. The upper floors will be needed as well, since the technology itself requires the upper floors (and since we're aligning with humans, who are messy). But, without the lower floors the aircraft will crash.
In this essay Paul Christiano proposes a definition of "AI alignment" which is more narrow than other definitions that are often employed. Specifically, Paul suggests defining alignment in terms of the motivation of the agent (which should be, helping the user), rather than what the agent actually does. That is, as long as the agent "means well", it is aligned, even if errors in its assumptions about the user's preferences or about the world at large lead it to actions that are bad for the user.
Rohin Shah's comment on the essay (which I believe is endorsed by Paul) reframes it as a particular way to decompose the AI safety problem. An often used decomposition is "definition-optimization": first we define what it means for an AI to be safe, then we understand how to implement a safe AI. In contrast, Paul's definition of alignment decomposes the AI safety problem as "motivation-competence": first we learn how to design AIs with good motivations, then we learn how to make them competent. Both Paul and Rohin argue that the "motivation" is the urgent part of the problem, the part on which technical AI safety research should focus.
In contrast, I will argue that the "motivation-competence" decomposition is not as useful as Paul and Rohin believe, and the "definition-optimization" decomposition is more useful.
The thesis behind the "motivation-competence" decomposition implicitly assumes a linear, one-dimensional scale of competence. Agents with good motivations and subhuman competence might make silly mistakes but are not catastrophically dangerous (since they are subhuman). Agents with good motivations and superhuman competence will only do mistakes that are "forgivable" in the sense that, our own mistakes would be as bad or worse. Ergo (the thesis concludes), good motivations are sufficient to solve AI safety.
However, in reality competence is multi-dimensional. AI systems can have subhuman skills in some domains and superhuman skills in other domains, as AI history showed time and time again. This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user. Moreover, there might be limits to the agent's knowledge about certain questions (such as, the user's preferences) that are inherent in the agent's epistemology (more on this below). Given such limits, the agent's competence becomes systematically lopsided. Furthermore, the elimination of such limits is as a large part of the "definition" part in the "definition-optimization" framing that the thesis rejects.
As a consequence of the multi-dimensional natural of competence, the difference between "well intentioned mistake" and "malicious sabotage" is much less clear than naively assumed, and I'm not convinced there is a natural way to remove the ambiguity. For example, consider a superhuman AI Alpha subject to an acausal attack. In this scenario, some agent Beta in the "multiverse" (= prior) convinces Alpha that Alpha exists in a simulation controlled by Beta. The simulation is set up to look like the real Earth for a while, making it a plausible hypothesis. Then, a "treacherous turn" moment arrives in which the simulation diverges from Earth, in a way calculated to make Alpha take irreversible actions that are beneficial for Beta and disastrous for the user.
In the above scenario, is Alpha "motivation-aligned"? We could argue it is not, because it is running the malicious agent Beta. But we could also argue it is motivtion-aligned, it just makes the innocent mistake of falling for Beta's trick. Perhaps it is possible to clarify the concept of "motivation" such that in this case, Alpha's motivations are considered bad. But, such a concept would depend in complicated ways on the agent's internals. I think that this is a difficult and unnatural approach, compared to "definition-optimization" where the focus is not on the internals but on what the agent actually does (more on this later).
The possibility of acausal attacks is a symptom of the fact that, environments with irreversible transitions are usually not learnable (this is the problem of traps in reinforcement learning, that I discussed for example here and here), i.e. it is impossible to guarantee convergence to optimal expected utility without further assumptions. When we add preference learning to the mix, the problem gets worse because now even if there are no irreversible transitions, it is not clear the agent will converge to optimal utility. Indeed, depending on the value learning protocol, there might be uncertainties about the user's preferences that the agent can never resolve (this is an example of what I meant by "inherent limits" before). For example, this happens in CIRL (even if the user is perfectly rational, this happens because the user and the AI have different action sets).
These difficulties with the "motivation-competence" framing are much more natural to handle in the "definition-optimization" framing. Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK). Specifically, the mathematical criteria of alignment I proposed are the "dynamic subjective regret bound" and the "dangerousness bound". The former is a criterion which simultaneous guarantees motivation-alignment and competence (as evidence that this criterion can be satisfied, I have the Dialogic Reinforcement Learning proposal). The latter is a criterion that doesn't guarantee competence in general, but guarantees specifically avoiding catastrophic mistakes. This makes it closer to motivation-alignment compated to subjective regret, but different in important ways: it refers to the actual things that agent does, and the ways in which these things might have catastrophic consequences.
In summary, I am skeptical that "motivation" and "competence" can be cleanly separately in a way that is useful for AI safety, whereas "definition" and "optimization" can be so separated: for example the dynamic subjective regret bound is a "definition" whereas dialogic RL and putative more concrete implementations thereof are "optimizations". My specific proposals might have fatal flaws that weren't discovered yet, but I believe that the general principle of "definition-optimization" is sound, while "motivation-competence" is not.
In this essay, Rohin sets out to debunk what ey perceive as a prevalent but erroneous idea in the AI alignment community, namely: "VNM and similar theorems imply goal-directed behavior". This is placed in the context of Rohin's thesis that solving AI alignment is best achieved by designing AI which is not goal-directed. The main argument is: "coherence arguments" imply expected utility maximization, but expected utility maximization does not imply goal-directed behavior. Instead, it is a vacuous constraint, since any agent policy can be regarded as maximizing the expectation of some utility function.
I have mixed feelings about this essay. On the one hand, the core argument that VNM and similar theorems do not imply goal-directed behavior is true. To the extent that some people believed the opposite, correcting this mistake is important. On the other hand, (i) I don't think the claim Rohin is debunking is the claim Eliezer had in mind in those sources Rohin cites (ii) I don't think that the conclusions Rohin draws or at least implies are the right conclusions.
The actual claim that Eliezer was making (or at least my interpretation of it) is, coherence arguments imply that if we assume an agent is goal-directed then it must be an expected utility maximizer, and therefore EU maximization is the correct mathematical model to apply to such agents.
Why do we care about goal-directed agents in the first place? The reason is, on the one hand goal-directed agents are the main source of AI risk, and on the other hand, goal-directed agents are also the most straightforward approach to solving AI risk. Indeed, if we could design powerful agents with the goals we want, these agents would protect us from unaligned AIs and solve all other problems as well (or at least solve them better than we can solve them ourselves). Conversely, if we want to protect ourselves from unaligned AIs, we need to generate very sophisticated long-term plans of action in the physical world, possibly restructuring the world in a rather extreme way to safe-guard it (compare with Bostrom's arguments for mass surveillance). The ability to generate such plans is almost by definition goal-directed behavior.
Now, knowing that goal-directed agents are EU maximizers doesn't buy us much. As Rohin justly observes, without further constraints it is a vacuous claim (although the situation becomes better if we constraint ourselves to instrumental reward functions). Moreover, the model of reasoning in complex environments that I'm advocating myself (quasi-Bayesian reinforcement learning) doesn't even look like EU maximization (technically there is a way to interpret it as EU maximization but it underspecifies the behavior). This is a symptom of the fact that the setting and assumptions of VNM and similar theorems are not good enough to study goal-directed behavior. However, I think that it can be an interesting and important line of research, to try and figure out the right setting and assumptions.
This last point is IMO the correct takeaway from Rohin's initial observation. In contrast, I remain skeptical about Rohin's thesis that we should dispense with goal-directedness altogether, for the reason I mentioned before: powerful goal-directed agents seem necessary or at least very desirable to create a defense system from unaligned AI. Moreover, the study of goal-directed agents is important to understand the impact of any powerful AI system on the world, since even a system not designed to be goal-directed can develop such agency (due to reasons like
malign hypotheses, mesa-optimization and self-fullfiling prophecies).
This essay provides some fascinating case studies and insights about coordination problems and their solutions, from a book by Elinor Ostrom. Coordination problems are a major theme in LessWrongian thinking (for good reasons) and the essay is a valuable addition to the discussion. I especially liked the 8 features of sustainable governance systems (although I wish we got a little more explanation for "nested enterprises").
However, I think that the dichotomy between "absolutism (bad)" and "organically grown institutions (good)" that the essay creates needs more nuance or more explanation. What is the difference between "organic" and "inorganic" institutions? All institutions "grew" somehow. The relevant questions are e.g. how democratic is the institution, whether the scope of the institution is the right scope for this problem, whether the stakeholders have skin in the game (feature 3) et cetera. The 8 features address some of that, but I wish it was more explicit.
Also, It's notable that all examples focus on relatively small scale problems. While it makes perfect sense to start by studying small problems before trying to understand the big problems, it does make me wonder whether going to higher scales brings in qualitatively new issues and difficulties. Paying to officials with parcels in the tail end works for water conflicts, but what is the analogous approach to global warming or multinational arms races?
Although normally I am all for judging arguments by their merits, regardless of who speaks them, I think that in this particular case we need to think twice before including the essay in the "Best of 2018" book. The notoriety of the author is such that including it risks serious reputation damage for the community, especially that the content of the essay might be interpreted as a veiled attempt to justify the author's moral transgressions. To be clear, I am not saying we should censor everything that this man ever said, but giving it the spotlight in "Best of 2018" seems like a bad choice.
This essay makes a valuable contribution to the vocabulary we use to discuss and think about AI risk. Building a common vocabulary like this is very important for productive knowledge transmission and debate, and makes it easier to think clearly about the subject.
This essay defines and clearly explains an important property of human moral intuitions: the divergence of possible extrapolations from the part of the state spaces we're used to think about. This property is a challenge in moral philosophy, that has implications on AI alignment and long-term or "extreme" thinking in effective altruism. Although I don't think that it was especially novel to me personally, it is valuable to have a solid reference for explaining this concept.
I think this post is a good and memorable explanation-by-analogy of what kind of research MIRI is doing to solve AI risk and a good and memorable response to some common criticism of or confusion about the former.
I think this post introduces a useful concept / way of thinking that I kept applying in my own life since reading it and that helped me understanding and dealing with certain social situations.